Let’s say you have data with a mix of random noise and signals. This could be for example radio signals with white background noise or people visiting an e-commerce website with a portion being genuine customers and a portion just randomly browsing. Can one filter out the real data from the noise? The answer is yes and is based on the fact that if you have enough data where a portion of it is based on a statistical distribution you can ‘see’ it. Technically speaking, if data is normally distributed it shows after a while. The chi-square (chi2) test is a test which detects this in data. The reason you look at a normal distribution is because of the central limit theorem and the assumption you have a lot of data. Whether one can do something similar for a small data set or for, say, a Poisson distribution I don’t know.

Below you will find a synthetic example of the effectiveness of chi2. We create a mix of noise and signals. In the first example the signals are mixed in bands within the data; three bands are randomly added to the true noise. In the second example we add the signals randomly (not in bands) in the data. We show in both cases that the chi2 test can detect the bands and the randomly added signals, It’s both a ‘proof’ and a nice code-example which can be used in projects.

You can find the gist here.


Signals in bands

We create a synthetic frame of data where three bands of signals are mixed randomly:

and the actual chi2 test is as follows:

Note that the chi2 test has a different implementation in Excel, MatLab, Mathematica and R. 

If you apply the test to the synthetic data and plot the result

Chi2 Filtering

which is massively convincing but at the same time not a surprise.

Signals everywhere

We can perform the same experiment but with the data spread across the whole set rather than keeping things in intervals:

The chi2 test and the result are just as easily extracted:

Chi2 everywhere.

which is just as convincing as the previous synthetic sample.