Probabilistic programming is not just another way of thinking, it’s just as effective as any other machine learning algorithm. If applied to the iris dataset (the hello-world of ML) you get something like the following.

We’ll model the binary classification of the ‘setosa’ and ‘versicolor’ types using the sepal length. So, we assume that the sigmoid \theta
\theta \sim (1 + \exp-(\alpha +\beta.x))^{-1}

can differentiate the two classes with

\alpha \sim N(0, 10)
\beta \sim N(0, 10)

and the binary classification (success) distributed as Bernoulli:

bin \sim B(\theta)

The data has to be transformed into a binary problem first:

import seaborn as sns
import pandas as pd
import pymc3 as pm

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sea
%matplotlib inline
iris = sea.load_dataset("iris")

df = iris.query("species == ('setosa', 'versicolor')")
y_0 = pd.Categorical(df['species']).codes
x_n = 'sepal_length'
x_0 = df[x_n].values

and the translation of the model description amounts to:

with pm.Model() as model_0:
    alpha = pm.Normal('alpha', mu=0, sd=10)
    beta = pm.Normal('beta', mu=0, sd=10)
    mu = alpha +, beta)
    theta = pm.Deterministic('theta', 1 / (1 + pm.math.exp(-mu)))
    bd = pm.Deterministic('bd', -alpha/beta)

    yl = pm.Bernoulli('yl', theta, observed=y_0)
    start = pm.find_MAP()
    step = pm.NUTS()
    trace_0 = pm.sample(5000, step, start)
chain_0 = trace_0[1000:]
varnames = ['alpha', 'beta', 'bd']
pm.traceplot(chain_0, varnames)

where you are free to experiment with various optimizations. Much like one would with neural networks in fact.

A nice visualization of the decision boundary can be obtained as follows:

theta = trace_0['theta'].mean(axis=0)
idx = np.argsort(x_0)
plt.plot(x_0[idx], theta[idx], color='b', lw=3);
plt.axvline(trace_0['bd'].mean(), ymax=1, color='red')
bd_hpd = pm.hpd(trace_0['bd'])
plt.fill_betweenx([0, 1], bd_hpd[0], bd_hpd[1], color='r', alpha=0.5)

plt.plot(x_0, y_0, 'o', color='k')
theta_hpd = pm.hpd(trace_0['theta'])[idx]
plt.fill_between(x_0[idx], theta_hpd[:,0], theta_hpd[:,1], color='b', alpha=0.5)

plt.xlabel(x_n, fontsize=16)
plt.ylabel(r'$\theta$', rotation=0, fontsize=16)

For myself, this is where one sees most clearly the difference with non-probabilistic approaches. The uncertainty of the prior propagates into the model and you get fuzzy predictive power.

Similarly for the parms distributions:

Finally, to use this model to predict new cases you can assemble something like this:

def classify(n, threshold):
    A simple classifying function
    n = np.array(n)
    mu = trace_0['alpha'].mean() + trace_0['beta'].mean() * n
    prob = 1 / (1 + np.exp(-mu))
    return prob, prob > threshold 

#usage: classify([5, 5.5, 6], 0.4)