Updated August 2019 for TensorFlow RC0

This is a recipe to analyze positive vs. negative sentiments in text using NLTK and TensorFlow v2.

One can find datasets in many places, especially on Kaggle. The datasets usually come in a ‘positive’ and ‘negative’ part but sometimes in one set with a column denoting the sentiment. If you don’t have two sets like this NLTK has it all but you need to assemble things a bit because the review come as separate files.

Easy enough, use something like this to compile the separate files into just two files:

Code on Gist

import sklearn
from sklearn.datasets import load_files
moviedir = r'/Users/You/nltk_data/corpora/movie_reviews'
movie_train = load_files(moviedir, shuffle=True)

pos =""
neg=""
for i,item in enumerate(movie_train.data):
    if movie_train.target[i] == 0: #neg
        neg += item.decode("utf-8").replace("\n"," ")  + "\n"
    else:
        pos += item.decode("utf-8").replace("\n"," ")  + "\n"
with open('/Users/You/desktop/positive.txt', 'wt') as f:
    f.write(pos)
with open('/Users/You/desktop/negative.txt', 'wt') as f:
    f.write(neg)  

Of course, you need some packages. Note that this example is based on TensorFlow RC0.

from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer 
import numpy as np 
import random 
import pickle 
from collections import Counter 

# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras

# Helper libraries
import numpy as np
import matplotlib.pyplot as plt

print(tf.__version__)


2.0.0-rc0

Normally you would use an embedding layer (things like GloVe) but let’s approach things in a simplistic fashion here. We’ll create for every review one big vector with an entry for every word appearing at least 50 times in the reviews.

Lemmatization is the process of converting a word to its base form. This is where you need NLTK to perform this conversion.

lemmatizer = WordNetLemmatizer() 
max_lines = 10000000 
pos = '/Users/swa/Desktop/LargeFiles/MovieReviews/positive.txt'
neg = '/Users/swa/Desktop/LargeFiles/MovieReviews/negative.txt'


def create_lexicon(pos, neg): 
    '''
       Returns a vector with the most important words
       for the given positive and negative reviews.
    '''
    lexicon = [] 
    for fi in [pos, neg]: 
        with open(fi, 'r') as f: 
            contents = f.readlines() 
            for l in contents[:max_lines]: 
                all_words = word_tokenize(l.lower()) 
                lexicon += list(all_words) 

    lexicon = [lemmatizer.lemmatize(i) for i in lexicon] 
    w_counts = Counter(lexicon) 

    l2 =[] # vector with the words appearing more than 50 times
    for w in w_counts: 
        if 1000 > w_counts[w] > 50: 
            l2.append(w) 
    return l2 

With the lexicon we now approach the reviews and convert them to vectors of the size of the lexicon and for each word how often it appears in the review.
This is a poor-man’s way of embedding text in a low-dimensional vector space. The simplification being that we do not embed affinity between words, there is no statistical distribution or minimization involved in this simple algorithm.

def create_embedding(sample, lexicon, classification): 
    '''
        Returns a lexicon-sized vector for each review.
    '''
    featureset = [] 
    with open(sample,'r') as f: 
        contents = f.readlines() 
        for l in contents[:max_lines]: 
            current_words = word_tokenize(l.lower()) 
            current_words = [lemmatizer.lemmatize(i) for i in current_words] 
            features = np.zeros(len(lexicon)) 
            for word in current_words: 
                if word.lower() in lexicon: 
                    index_value = lexicon.index(word.lower()) 
                    features[index_value] += 1 

            features = list(features)
            featureset.append([features, classification]) 

    return featureset     

Now we effectively apply this to the dataset.

lexicon = create_lexicon(pos,neg) 
features = [] 
features += create_embedding(pos, lexicon,[1,0]) 
features += create_embedding(neg, lexicon,[0,1]) 
random.shuffle(features) 
features = np.array(features) 

We take 10% of the data for testing purposes and create numpy arrays because that’s what TensorFlow expects. The switch between list and numpy array is because of the ease with which you can subset things when you have an array.

testing_size = int(0.1*len(features)) 
X_train = np.array(list(features[:,0][:-testing_size])) 
y_train = np.array(list(features[:,1][:-testing_size])) 
X_test = np.array(list(features[:,0][-testing_size:])) 
y_test = np.array(list(features[:,1][-testing_size:]))

Each vector in the sets has the dimension of the lexicon:

assert(X_train.shape[1] == len(lexicon))

From this point on you recognize that AI is as much an art as it is science. The way you assemble the network is where expeerience and insights show. For playing purposes, anything works.

 model = keras.Sequential([

    keras.layers.Dense(13, activation='relu', input_dim=len(lexicon)),
    keras.layers.Dense(10, activation='relu'),
    keras.layers.Dense(2, activation='sigmoid')  
])
model.summary()


Model: "sequential_15"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_35 (Dense)             (None, 13)                30121     
_________________________________________________________________
dense_36 (Dense)             (None, 10)                140       
_________________________________________________________________
dense_37 (Dense)             (None, 2)                 22        
=================================================================
Total params: 30,283
Trainable params: 30,283
Non-trainable params: 0
_________________________________________________________________




model.compile(optimizer='adam', 
              loss='binary_crossentropy',
              metrics=['accuracy'])




history = model.fit(X_train, y_train, epochs=5)


Train on 1800 samples
Epoch 1/5
1800/1800 [==============================] - 0s 65us/sample - loss: 0.0374 - accuracy: 0.9967
Epoch 2/5
1800/1800 [==============================] - 0s 55us/sample - loss: 0.0222 - accuracy: 0.9994
Epoch 3/5
1800/1800 [==============================] - 0s 51us/sample - loss: 0.0147 - accuracy: 1.0000
Epoch 4/5
1800/1800 [==============================] - 0s 50us/sample - loss: 0.0102 - accuracy: 1.0000
Epoch 5/5
1800/1800 [==============================] - 0s 50us/sample - loss: 0.0076 - accuracy: 1.0000




results = model.evaluate(X_test, y_test, verbose=0)
print(results)


[0.604368144646287, 0.8225]




history_dict = history.history

import matplotlib.pyplot as plt

acc = history_dict['accuracy']

loss = history_dict['loss']
epochs = range(1, len(acc) + 1)

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

png