This is another straightforward example of deeplearning in Keras. Let’s summarize what is going on here:
- IMDB reviews are bits or text consisting of words (duh!) but all these words are converted to numbers. In addition, the number of a word corresponds to its importance in terms of occurrence. So, a small numbers means it occurs more often than a large number.
- reviews have different lengths, so one truncates everything to the same length (here 5000)
- part of the reviews are taken for testing (here 33% of the whole)
- only the 500 most important words are considered
- the input is hence a vector of size 5000 which is a bit large and sparse so an embedding is applied which maps these vectors to a vector of size 32. Word embeddings is a topic of its own and you should have a look at the Glove site for instance.
- the one-dimensional convolution is well explained in this article
- the max-pooling is picking up the maximum out of a neighborhood and is a way to reduce the dimension
- flattening a matrix is simply putting the rows of the matrix one after another
- the dense layers are the usual deeplearning layers where the weights are trained. Note that we want a yes/no classification and the last dense layer is dimension one.
- the whole networks has a low accuracy but is trained in a minute, which is altogether not too bad
The beauty of Keras resides really in the ease to add/change things. It runs on top of Theano and TensorFlow so you get really a delicious framework with all you can wish for.
import numpy as np import os os.environ['THEANO_FLAGS'] = "device=gpu" from keras.datasets import imdb from keras.models import Sequential from keras.layers import Dense from keras.layers import Flatten from keras.layers.convolutional import Convolution1D from keras.layers.convolutional import MaxPooling1D from keras.layers.embeddings import Embedding from keras.preprocessing import sequence as prep import time # fix random seed for reproducibility seed = 7 numpy.random.seed(seed) # load the dataset but only keep the top n words, zero the rest top_words = 5000 test_split = 0.33 (X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=top_words, test_split=test_split) # pad dataset to a maxumum review length in words max_words = 500 X_train = prep.pad_sequences(X_train, maxlen=max_words) X_test = prep.pad_sequences(X_test, maxlen=max_words) # create the model model = Sequential() model.add(Embedding(top_words, 32, input_length=max_words)) model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu')) model.add(MaxPooling1D(pool_length=max_words)) model.add(Flatten()) model.add(Dense(250, activation='relu')) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) print(model.summary()) start = time.time() model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=5, batch_size=128, verbose=2) print("> Training is done in %.2f seconds." % (time.time() - start)) scores = model.evaluate(X_test, y_test, verbose=2) print("Accuracy: %.2f%%" % (scores[1] * 100)) # Accuracy: 86.87% # gpu: 62s # cpu: 58s