In a previous article I explained the way I see neural networks and gave some basic examples. Personally I believe in ‘simple examples’ as a way to comprehend crucial principles and this article continues in this fashion. By looking at a single cell the activation functions are highlighted and it’s shown that picking the most appropriate one can be done using grid-search. Along the way you can see that simple feedforward networks are a way to dissipate noise and that neural networks are really just functions.

Like the previous article, the examples are on top of the Keras framework but you can recreate all of this in TensorFlow, Caffe or any other neural framework.

Feedforward networks are fairly easy but can nevertheless produce great results. One would sometimes forget, considering what the internet is buzzing about, that not everything needs to be casted in convolutional and/or recurrent topologies.

None of the examples require GPU or datacenters, the synthetic or artificial data is designed to highlight a particular aspect and not a real-world case.

This article is also available as a Jupyter notebook on Gist.

Counting from 0 to 9

Let’s start with learning a network to count from 0 to 9. The code predicts the next number for a given sequence of the previous numbers. The number 9 is followed by 0 in cycles.

Note that the data is partitioned in buckets so the to-be-predicted number is not based on a single digit but on a bucket of digits. When data has some time-like ordering one typically uses networks with memories aka recurrence but this bucket approach works just as well in simple situations (i.e. no variations and few features). You should try to make the same prediction network with only a single number. The bucket approach is useful as a preparation for recurrent networks where one typically has this step-backward situation.

Bit shift operator

Like the counting example we take some binary buckets and shift the 1-bits to the right.

The output is not precise but you can truncate it and plot it to see it more clearly.

Using the score you can see that all solutions give accuracy 100% but the loss differs:
single: 20.62%
one extra: 4.05%
two extra: 5.82%
one (20): 0.24%
two (20): 0.00%

So, you don’t need to increase complexity in order to achieve accuracy but the signal will be more sharp if you do.

The truncation to integers can also be achieved by means of custom layers. Below you can find the Round layer which does precisely this.

Neurons as functions

A single neuron, node or cell is just a function and if you play a bit with the API you can also visualize the various activation functions.
Let’s explicitly assign weights to a single cell thus preventing this to affect the output:

Note that if no activation is specified it will default to linear and that compiling a network will typically assign random weights to the nodes. Details of how the activations are effectively implemeted (really straightforward) can be found here but you can also easily plot the activation functions by using the single cell as a functions:

If you want a custom activation function you can simply plug in your own function instead of a name (string). Whether something like the sine below makes sense is of course another matter, but you can indeed use anything you like:

Note that activation can be added separately to the model like so

Finally, there are also advanced activation functions in Keras which are there for specific tasks. Though you can use them like any other activation function, they work well for image-oriented learning. For instance, the parametric rectified linear unit or PReLu function was invented to surpass human-level performance on ImageNet classification.

Picking the most appropriate activation functions

The activation functions seem to be only slightly different but they actually do make a big difference. In the example below we have some artificial data consisting of a line with a bump in the middle together with some noise and try to make the neural network learn the shape of the curve. You can see from the plot below that relu does a much lesser job than the hyperbolic tangent activation.

How can one optimize this and pick up the most appropriate activation? You can loop over the various activations or use the sklearn wrapper for Keras which allows you to use Keras networks as machine learning models in sklearn.

The scores can be seen from grid_scores_:


You can further refine the network with grid-searching the appropriate optimizer, loss function and pretty much every parameter (including the weights). Ain’t it wonderful you can combine Keras and Scikit-learn? Jason Brownlee has a great blog post on how to do all of this.

The neural network (especially the low-loss ones) approximates the syntehtic function quite well and can see through the super-imposed noise. One could of course filter out the noise in other ways (chi-square or moving averages) but the fact that the network does this without explicitly encoding it is a nice feature on its own.

Cellular automata

Cellular automata are in a way primitive neural networks in the sense that they encapsulate state machines which can be found inside e.g. LSTM nodes. From another angle, a cellular automata is just a (discrete) function and like any other function can be mimiced or approximated by neural nets. The rule 30 used below is a world on its own and one could probably find interesting morphisms (same category?) between the world of automata and the world or neural networks.

Let’s try to use the data to train a dense network. Note that we define a custom layer to output bits and that it’s really easy to add your own modules or layers. Like above, there are other ways to truncate data but this shows how you can plug into the API.

It’s clear that this approach is not successful. One way to proceed would be to engage recurrent or convolutional networks. The other is to model the actual rule and not the instances produced by the rule.
The following is a straighforward function prediction model with 100% accuracy.

Reuters classification

In this last example we pick up Reuters data which has been preprocessed (as part of the Keras framework). The original data consists of paragraphs but the words have already been embedded (mapped to vectors) and you can extract immediately training and test data.

A word about dropout. This is a way to regularize networks and to suppress overfitting. It effectively switches off some of the neurons (in a random fashion) so that feedback does not affect all of the neurons all the time. Typically you will see that a dropout of half of the nodes is a common approach. Like everything else, some experimentation reveals what works best with the data and what you aim for.

This gives around 80% accuracy in very little time (couple of minutes). If you try the same dataset with XGBoost it will take quite a long time (around 10 minutes) for the same 80% accuracy:

For sure one can tune both approaches but it shows that neural nets are not necessarily data and processing hungry in all cases and that neural networks are easy to play with. At least, if you use a framework like Keras.