Updated August 2019 for TensorFlow RC0
One can find datasets in many places, especially on Kaggle. The datasets usually come in a ‘positive’ and ‘negative’ part but sometimes in one set with a column denoting the sentiment. If you don’t have two sets like this NLTK has it all but you need to assemble things a bit because the review come as separate files.
Easy enough, use something like this to compile the separate files into just two files:
import sklearn from sklearn.datasets import load_files moviedir = r'/Users/You/nltk_data/corpora/movie_reviews' movie_train = load_files(moviedir, shuffle=True) pos ="" neg="" for i,item in enumerate(movie_train.data): if movie_train.target[i] == 0: #neg neg += item.decode("utf-8").replace("\n"," ") + "\n" else: pos += item.decode("utf-8").replace("\n"," ") + "\n" with open('/Users/You/desktop/positive.txt', 'wt') as f: f.write(pos) with open('/Users/You/desktop/negative.txt', 'wt') as f: f.write(neg)
Of course, you need some packages. Note that this example is based on TensorFlow RC0.
from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer import numpy as np import random import pickle from collections import Counter # TensorFlow and tf.keras import tensorflow as tf from tensorflow import keras # Helper libraries import numpy as np import matplotlib.pyplot as plt print(tf.__version__) 2.0.0-rc0
Normally you would use an embedding layer (things like GloVe) but let’s approach things in a simplistic fashion here. We’ll create for every review one big vector with an entry for every word appearing at least 50 times in the reviews.
Lemmatization is the process of converting a word to its base form. This is where you need NLTK to perform this conversion.
lemmatizer = WordNetLemmatizer() max_lines = 10000000 pos = '/Users/swa/Desktop/LargeFiles/MovieReviews/positive.txt' neg = '/Users/swa/Desktop/LargeFiles/MovieReviews/negative.txt' def create_lexicon(pos, neg): ''' Returns a vector with the most important words for the given positive and negative reviews. ''' lexicon =  for fi in [pos, neg]: with open(fi, 'r') as f: contents = f.readlines() for l in contents[:max_lines]: all_words = word_tokenize(l.lower()) lexicon += list(all_words) lexicon = [lemmatizer.lemmatize(i) for i in lexicon] w_counts = Counter(lexicon) l2 = # vector with the words appearing more than 50 times for w in w_counts: if 1000 > w_counts[w] > 50: l2.append(w) return l2
With the lexicon we now approach the reviews and convert them to vectors of the size of the lexicon and for each word how often it appears in the review.
This is a poor-man’s way of embedding text in a low-dimensional vector space. The simplification being that we do not embed affinity between words, there is no statistical distribution or minimization involved in this simple algorithm.
def create_embedding(sample, lexicon, classification): ''' Returns a lexicon-sized vector for each review. ''' featureset =  with open(sample,'r') as f: contents = f.readlines() for l in contents[:max_lines]: current_words = word_tokenize(l.lower()) current_words = [lemmatizer.lemmatize(i) for i in current_words] features = np.zeros(len(lexicon)) for word in current_words: if word.lower() in lexicon: index_value = lexicon.index(word.lower()) features[index_value] += 1 features = list(features) featureset.append([features, classification]) return featureset
Now we effectively apply this to the dataset.
lexicon = create_lexicon(pos,neg) features =  features += create_embedding(pos, lexicon,[1,0]) features += create_embedding(neg, lexicon,[0,1]) random.shuffle(features) features = np.array(features)
We take 10% of the data for testing purposes and create numpy arrays because that’s what TensorFlow expects. The switch between list and numpy array is because of the ease with which you can subset things when you have an array.
testing_size = int(0.1*len(features)) X_train = np.array(list(features[:,0][:-testing_size])) y_train = np.array(list(features[:,1][:-testing_size])) X_test = np.array(list(features[:,0][-testing_size:])) y_test = np.array(list(features[:,1][-testing_size:]))
Each vector in the sets has the dimension of the lexicon:
assert(X_train.shape == len(lexicon))
From this point on you recognize that AI is as much an art as it is science. The way you assemble the network is where expeerience and insights show. For playing purposes, anything works.
model = keras.Sequential([ keras.layers.Dense(13, activation='relu', input_dim=len(lexicon)), keras.layers.Dense(10, activation='relu'), keras.layers.Dense(2, activation='sigmoid') ]) model.summary() Model: "sequential_15" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_35 (Dense) (None, 13) 30121 _________________________________________________________________ dense_36 (Dense) (None, 10) 140 _________________________________________________________________ dense_37 (Dense) (None, 2) 22 ================================================================= Total params: 30,283 Trainable params: 30,283 Non-trainable params: 0 _________________________________________________________________ model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) history = model.fit(X_train, y_train, epochs=5) Train on 1800 samples Epoch 1/5 1800/1800 [==============================] - 0s 65us/sample - loss: 0.0374 - accuracy: 0.9967 Epoch 2/5 1800/1800 [==============================] - 0s 55us/sample - loss: 0.0222 - accuracy: 0.9994 Epoch 3/5 1800/1800 [==============================] - 0s 51us/sample - loss: 0.0147 - accuracy: 1.0000 Epoch 4/5 1800/1800 [==============================] - 0s 50us/sample - loss: 0.0102 - accuracy: 1.0000 Epoch 5/5 1800/1800 [==============================] - 0s 50us/sample - loss: 0.0076 - accuracy: 1.0000 results = model.evaluate(X_test, y_test, verbose=0) print(results) [0.604368144646287, 0.8225] history_dict = history.history import matplotlib.pyplot as plt acc = history_dict['accuracy'] loss = history_dict['loss'] epochs = range(1, len(acc) + 1) plt.plot(epochs, loss, 'bo', label='Training loss') plt.title('Training and validation loss') plt.xlabel('Epochs') plt.ylabel('Loss') plt.legend() plt.show()