When talking about dimensional reduction in the context of machine learning you have many options; linear discriminant analysis (LDA), principal component analysis (PCA) and many other. Here I want to highlight a technique that I explored as part of a very large research project and which entails the usage of chi2.

Imagine the typical marketing context of clients consuming attributes within some business domain (online bookshop, marketing interactions through mail and whatnot…) creating a big matrix with the rows holding single clients and the columns holding single attributes. This binary matrix contains zero’s and one’s corresponding to ‘client has not used’ and ‘client has used’ respectively. Add to this big matrix and additional binary column which tells whether the clients has bought the product or not (technically; a conversion occurred). Seems easy enough to use various classifiers on this type of matrix until you know that there are half a million columns involved. Obviously you need to filter out the significant attributes and use those with the classifier. This process is called dimensional reduction and is both a deep and subtle topic in machine learning.

One of the classic methods you learn in standard statistics courses is the chi2-test which tells you whether some vectors are correlated or not based on whether some calculation output results in a number sitting in the body or in the tail of the normal distribution. The question I asked myself is whether the method can filter out signal from noise in a situation where you have many thousands of vectors and you have a priori defined some of them as random or intentionally correlated to the classification (the last column we added). To my delight the answer is univocally ‘yes’ it very sharply picks out the columns which are by design correlated to the classification variable. Of course, like anything else in statistics you have a blurred boundary and you define a threshold but I would hope that the picture below is rather convincing nevertheless.

The code below uses the standard R command ‘chisq’ and be careful when using other software since it seems every app and library has its own way of doing chi2.

So, the plan is as follows:

  • create a data frame with rows corresponding to clients, some of which are classified (Target=1) and some not (Target=0)
  • create some columns which have a positive correlation to the Target and a lot of other columns which are purely random (no correlation)
  • the columns are not mixed up but the correlated ones are within some boundaries. This helps the visualization but in no way is crucial with respect to the result.
  • chi2 is applied and the Pearson p-value plotted showing the boundaries which were artificially created. Thus showing that despite a lot of randomness the predefined intervals can be recovered, as can be seen in the output below.

Chi2 Filtering

We define three column intervals within which the attributes are correlated to the target. Within the intervals there is a probability of having a 1 and which needs to be higher than the one outside the intervals.

library(dplyr)
library(rpart)

probabilityInsideCluster = 0.5
probabilityOutsideCluster = 0.2
probabilityNoise = 0.3
clusterBoundaries=c(10,25,56,81,151,293)
clientTypeCount = c(0,0,0,0)

These values define a custom stochastic variable which generates a single client. The data frame is then simply a binding of as many clients as you wish:

makeClientsFrame = function(attribCount = 1000, clientCount = 1000, typeDistribution = c(0.0,0.3,0.3,0.4) ){

  # a stochastic cluster variable
  # if k is null or 0 the noise is uniform
  stoch = function(k, i){
    if(is.null(k) || k == 0)
    {
      return(ifelse(runif(1)<probabilityNoise, 1, 0))
    }
    if(clusterBoundaries[2*k-1]

The chi2 function is applied in the usual fashion on a vector as follows

chi = function(frame, index){
  colName = paste("A", index, sep = "")
  colIndex = as.numeric( which(colnames(frame) == colName))
  targetIndex= as.numeric( which(colnames(frame) == "Target"))
  q = frame[,c(colIndex, targetIndex)]
  col = as.symbol(colnames(q)[1])
  c1 = count_(filter(q, Target == 1), colName )$n
  c2 = count_(filter(q, Target == 2), colName )$n
  c3 = count_(filter(q, Target == 3), colName )$n
  df = data.frame(c1, c2, c3)
  chisq.test(df)$p.value
}

Applying this effectively to our base frame and plotting the p-value

chis = lapply(1:500, function(x){chi(baseFrame, x)})
plot(1:500, chis)
abline(v=clusterBoundaries, col="red")

you get the image displayed above. While I did know that chi2 was a method to get correlations out of data I did not expect to have such a contrasted picture. What does it means with respect to dimensional reduction? It means that you can throw away all the attributes if the p-value is above, say, 0.5. Of course, with real-world data things are not so clean and a subtle choice of threshold is given in by the business context or some other external input.