The essence of propensity scoring in pseudo-code is as follows:

  • a distance metric is defined between vectors of arbitrary dimensions
  • for each classified vector c:
    • for each unclassified vector u:
      • measure the distance between u and c
    • retain the smallest distance for u. In a kNN approach the actual specific c would matter but not in this case.

At the end of all this looping you get a list of customers with a score (corresponding to the shortest distance to a classified one).

The typical picture you get out of this looks like the following:

Propensity Spikes.


A sharp amount of customers being close to the existing customer and the rest. That is, there always appears a small set of almost-buying ones. Upon plotting the distance versus the amount of customers that you get for that distance you see the following sigmoid curve

Propensity Curve.

which again emphasizes the small set of almost-classifieds and a large distance to the rest.

Rcpp code

This code is really a straightforward implementation of the pseudo-code above. There are technical tricks related to Rcpp but that’s akin to the Rcpp language and not to the algorithm at hand.

Using it in R

With this Rcpp algorithm one can move on to creating the scores for actual data. You will need a frame containing your existing customers and a frame with the visitor or to-be customers:


One important note here is that the gathered method uses the transpose of the frame. Simply because looping over columns is much easier in Rcpp than looping over rows. Hence the “t” of the transposition when passing the data.