Probabilistic programming is not just another way of thinking, it’s just as effective as any other machine learning algorithm. If applied to the iris dataset (the hello-world of ML) you get something like the following.

We’ll model the binary classification of the ‘setosa’ and ‘versicolor’ types using the sepal length. So, we assume that the sigmoid

can differentiate the two classes with

and the binary classification (success) distributed as Bernoulli:

The data has to be transformed into a binary problem first:

import seaborn as sns import pandas as pd import pymc3 as pm import numpy as np import matplotlib.pyplot as plt import seaborn as sea sea.set() sea.set %matplotlib inline iris = sea.load_dataset("iris") df = iris.query("species == ('setosa', 'versicolor')") y_0 = pd.Categorical(df['species']).codes x_n = 'sepal_length' x_0 = df[x_n].values

and the translation of the model description amounts to:

with pm.Model() as model_0: alpha = pm.Normal('alpha', mu=0, sd=10) beta = pm.Normal('beta', mu=0, sd=10) mu = alpha + pm.math.dot(x_0, beta) theta = pm.Deterministic('theta', 1 / (1 + pm.math.exp(-mu))) bd = pm.Deterministic('bd', -alpha/beta) yl = pm.Bernoulli('yl', theta, observed=y_0) start = pm.find_MAP() step = pm.NUTS() trace_0 = pm.sample(5000, step, start) chain_0 = trace_0[1000:] varnames = ['alpha', 'beta', 'bd'] pm.traceplot(chain_0, varnames)

where you are free to experiment with various optimizations. Much like one would with neural networks in fact.

A nice visualization of the decision boundary can be obtained as follows:

theta = trace_0['theta'].mean(axis=0) idx = np.argsort(x_0) plt.plot(x_0[idx], theta[idx], color='b', lw=3); plt.axvline(trace_0['bd'].mean(), ymax=1, color='red') bd_hpd = pm.hpd(trace_0['bd']) plt.fill_betweenx([0, 1], bd_hpd[0], bd_hpd[1], color='r', alpha=0.5) plt.plot(x_0, y_0, 'o', color='k') theta_hpd = pm.hpd(trace_0['theta'])[idx] plt.fill_between(x_0[idx], theta_hpd[:,0], theta_hpd[:,1], color='b', alpha=0.5) plt.xlabel(x_n, fontsize=16) plt.ylabel(r'$\theta$', rotation=0, fontsize=16)

For myself, this is where one sees most clearly the difference with non-probabilistic approaches. The uncertainty of the prior propagates into the model and you get fuzzy predictive power.

Similarly for the parms distributions:

Finally, to use this model to predict new cases you can assemble something like this:

def classify(n, threshold): """ A simple classifying function """ n = np.array(n) mu = trace_0['alpha'].mean() + trace_0['beta'].mean() * n prob = 1 / (1 + np.exp(-mu)) return prob, prob > threshold #usage: classify([5, 5.5, 6], 0.4)