This is another straightforward example of deeplearning in Keras. Let’s summarize what is going on here:

  • IMDB reviews are bits or text consisting of words (duh!) but all these words are converted to numbers. In addition, the number of a word corresponds to its importance in terms of occurrence. So, a small numbers means it occurs more often than a large number.
  • reviews have different lengths, so one truncates everything to the same length (here 5000)
  • part of the reviews are taken for testing (here 33% of the whole)
  • only the 500 most important words are considered
  • the input is hence a vector of size 5000 which is a bit large and sparse so an embedding is applied which maps these vectors to a vector of size 32. Word embeddings is a topic of its own and you should have a look at the Glove site for instance.
  • the one-dimensional convolution is well explained in this article
  • the max-pooling is picking up the maximum out of a neighborhood and is a way to reduce the dimension
  • flattening a matrix is simply putting the rows of the matrix one after another
  • the dense layers are the usual deeplearning layers where the weights are trained. Note that we want a yes/no classification and the last dense layer is dimension one.
  • the whole networks has a low accuracy but is trained in a minute, which is altogether not too bad

The beauty of Keras resides really in the ease to add/change things. It runs on top of Theano and TensorFlow so you get really a delicious framework with all you can wish for.

import numpy as np
import os
os.environ['THEANO_FLAGS'] = "device=gpu"
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.convolutional import Convolution1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence as prep
import time
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load the dataset but only keep the top n words, zero the rest
top_words = 5000
test_split = 0.33
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=top_words, test_split=test_split)
# pad dataset to a maxumum review length in words
max_words = 500
X_train = prep.pad_sequences(X_train, maxlen=max_words)
X_test = prep.pad_sequences(X_test, maxlen=max_words)
# create the model
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu'))
model.add(MaxPooling1D(pool_length=max_words))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
start = time.time()
model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=5, batch_size=128, verbose=2)
print("> Training is done in %.2f seconds." % (time.time() - start))
scores = model.evaluate(X_test, y_test, verbose=2)
print("Accuracy: %.2f%%" % (scores[1] * 100))
# Accuracy: 86.87%
# gpu: 62s
# cpu: 58s