There is a common wisdom that given a dataset on which you wish to apply some machine learning that training should be done with 80% and testing on the remaining 20%. The problem with this is that it implicitly assumes that the data is balanced in the sense that if you have e.g. a binary classification problem you have equally many classes in the data. This is however not the case in many real-world situations. Imagine for instance you analyze data related to plane or car accidents; there are (luckily) many more small than large incidents. Imagine you wish to optimize the conversion of website visitors; there are many more people just looking around than effectively buying something.
How to apply the 80/20 idea to imbalanced data is the generic topic of imbalanced machine learning.
Collecting more data and using over/undersampling are often suggested when dealing with imbalanced data. While collecting more data is always a good idea it’s not always feasible or practical. Oversampling means taking copies of a class to increase the amount. Undersampling means taking a subset of superior class to reduce the imbalance. Undersampling is actually what already happens above with the 80/20 proportions. The problem is: how to undersample in a statistically correct fashion? Oversampling is something that can help but with a proportion of 1:100 the question is whether replicating 100 times the classified ones is acceptable?
Synthetic data is another approach and amounts to create data from existing data in such a way that it mimics the characteristics of the existing one. It’s different than over/undersampling in the sense that it does not simply replicates data, it shifts it slightly while keeping the essence. The example below shows how it works.
The SMOTE approach (Synthetic Minority Over-sampling Technique) is a well-documented technique creating synthetic samples seems to work for us (see below).
Finally, penalized modeling means that predictions are not considered equal for the different classes. This amounts to penalizing misclassifications in some way. This technique then aims to improved the predictive accuracy but does not alter anything on the level of the data.
Below we highlight the SMOTE technique and show that even with balanced data you can increase predictive accuracy. More info around imabalanced data can also be found here:
We will use the following R-packages
caret
package for getting the confusion matrix.rpart
package for decision treesrocr
package for ROC/AUCDMwR
package for SMOTE’ing data
If you need to install things use
install.packages("DMwR", dependencies=TRUE, repos='http://cran.rstudio.com/')
and possible need to set the repo location
#options(repos=structure(c(CRAN="https://lib.ugent.be/CRAN/")))
options(repos=structure(c(CRAN="https://cloud.r-project.org/")))
# using the 164 ClientMetaFeature frame
library(data.table)
library(rpart)
library(caret)
library(ROCR)
library(DMwR)
SMOTE’d data: illustration
This is a simple example showing that the addition of fake data (using the SMOTE algorithm) does increase the predictive power of decision trees. The visualization is in particular helpful to see that the synthetic samples are not purely random shifts but taken in a smart way.
The iris
dataset is the often used set in R to play with machine learning.
data(iris)
data <- iris[, c(1, 2, 5)]
data$Species <- factor(ifelse(data$Species == "setosa","rare","common"))
## checking the class distribution of this artificial data set
table(data$Species)
In this case there is a 1:2 imbalance which we try to shift. If we create a decision tree on this dataset together with some test data we get
testData = data[sample(1:nrow(data),50),]
trainingData = data[sample(1:nrow(data),80),]
iniTree = rpartXse(Species ~ .,trainingData, se=0.5)
predictions = as.data.frame(predict(iniTree, testData))
rocrpred <- prediction(predictions[,2], testData$Species)
as.numeric(performance(rocrpred, "auc")@y.values) # 0.678
plot(performance(rocrpred, "tpr", "fpr"), colorize = TRUE)
newData <- SMOTE(Species ~ ., data, perc.over = 600,perc.under=100)
table(newData$Species)
par(mfrow = c(1, 2))
plot(data[, 1], data[, 2], pch = 19 + as.integer(data[, 3]),
main = "Original Data")
plot(newData[, 1], newData[, 2], pch = 19 + as.integer(newData[,3]),
main = "SMOTE'd Data")
With this data we have an almost 1:1 balanced set and the plot shows that the new rows interpolate slightly the existing one. If we now create a decision tree based on the SMOTE’d data and use the same test data we get:
smoTree <- rpartXse(Species ~ .,newData, se=0.5)
predictions = as.data.frame(predict(smoTree, testData))
rocrpred <- prediction(predictions[,2], testData$Species)
print(as.numeric(performance(rocrpred, "auc")@y.values) )
plot(performance(rocrpred, "tpr", "fpr"), colorize = TRUE)