# Analyzing sentiments

Updated August 2019 for TensorFlow RC0

This is a recipe to analyze positive vs. negative sentiments in text using NLTK and TensorFlow v2.

One can find datasets in many places, especially on Kaggle. The datasets usually come in a ‘positive’ and ‘negative’ part but sometimes in one set with a column denoting the sentiment. If you don’t have two sets like this NLTK has it all but you need to assemble things a bit because the review come as separate files.

Easy enough, use something like this to compile the separate files into just two files:

Code on Gist

import sklearn
moviedir = r'/Users/You/nltk_data/corpora/movie_reviews'

pos =""
neg=""
for i,item in enumerate(movie_train.data):
if movie_train.target[i] == 0: #neg
neg += item.decode("utf-8").replace("\n"," ")  + "\n"
else:
pos += item.decode("utf-8").replace("\n"," ")  + "\n"
with open('/Users/You/desktop/positive.txt', 'wt') as f:
f.write(pos)
with open('/Users/You/desktop/negative.txt', 'wt') as f:
f.write(neg)


Of course, you need some packages. Note that this example is based on TensorFlow RC0.

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import numpy as np
import random
import pickle
from collections import Counter

# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras

# Helper libraries
import numpy as np
import matplotlib.pyplot as plt

print(tf.__version__)

2.0.0-rc0


Normally you would use an embedding layer (things like GloVe) but let’s approach things in a simplistic fashion here. We’ll create for every review one big vector with an entry for every word appearing at least 50 times in the reviews.

Lemmatization is the process of converting a word to its base form. This is where you need NLTK to perform this conversion.

lemmatizer = WordNetLemmatizer()
max_lines = 10000000
pos = '/Users/swa/Desktop/LargeFiles/MovieReviews/positive.txt'
neg = '/Users/swa/Desktop/LargeFiles/MovieReviews/negative.txt'

def create_lexicon(pos, neg):
'''
Returns a vector with the most important words
for the given positive and negative reviews.
'''
lexicon = []
for fi in [pos, neg]:
with open(fi, 'r') as f:
for l in contents[:max_lines]:
all_words = word_tokenize(l.lower())
lexicon += list(all_words)

lexicon = [lemmatizer.lemmatize(i) for i in lexicon]
w_counts = Counter(lexicon)

l2 =[] # vector with the words appearing more than 50 times
for w in w_counts:
if 1000 > w_counts[w] > 50:
l2.append(w)
return l2


With the lexicon we now approach the reviews and convert them to vectors of the size of the lexicon and for each word how often it appears in the review.
This is a poor-man’s way of embedding text in a low-dimensional vector space. The simplification being that we do not embed affinity between words, there is no statistical distribution or minimization involved in this simple algorithm.

def create_embedding(sample, lexicon, classification):
'''
Returns a lexicon-sized vector for each review.
'''
featureset = []
with open(sample,'r') as f:
for l in contents[:max_lines]:
current_words = word_tokenize(l.lower())
current_words = [lemmatizer.lemmatize(i) for i in current_words]
features = np.zeros(len(lexicon))
for word in current_words:
if word.lower() in lexicon:
index_value = lexicon.index(word.lower())
features[index_value] += 1

features = list(features)
featureset.append([features, classification])

return featureset


Now we effectively apply this to the dataset.

lexicon = create_lexicon(pos,neg)
features = []
features += create_embedding(pos, lexicon,[1,0])
features += create_embedding(neg, lexicon,[0,1])
random.shuffle(features)
features = np.array(features)


We take 10% of the data for testing purposes and create numpy arrays because that’s what TensorFlow expects. The switch between list and numpy array is because of the ease with which you can subset things when you have an array.

testing_size = int(0.1*len(features))
X_train = np.array(list(features[:,0][:-testing_size]))
y_train = np.array(list(features[:,1][:-testing_size]))
X_test = np.array(list(features[:,0][-testing_size:]))
y_test = np.array(list(features[:,1][-testing_size:]))


Each vector in the sets has the dimension of the lexicon:

assert(X_train.shape[1] == len(lexicon))


From this point on you recognize that AI is as much an art as it is science. The way you assemble the network is where expeerience and insights show. For playing purposes, anything works.

 model = keras.Sequential([

keras.layers.Dense(13, activation='relu', input_dim=len(lexicon)),
keras.layers.Dense(10, activation='relu'),
keras.layers.Dense(2, activation='sigmoid')
])
model.summary()

Model: "sequential_15"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
dense_35 (Dense)             (None, 13)                30121
_________________________________________________________________
dense_36 (Dense)             (None, 10)                140
_________________________________________________________________
dense_37 (Dense)             (None, 2)                 22
=================================================================
Total params: 30,283
Trainable params: 30,283
Non-trainable params: 0
_________________________________________________________________

loss='binary_crossentropy',
metrics=['accuracy'])

history = model.fit(X_train, y_train, epochs=5)

Train on 1800 samples
Epoch 1/5
1800/1800 [==============================] - 0s 65us/sample - loss: 0.0374 - accuracy: 0.9967
Epoch 2/5
1800/1800 [==============================] - 0s 55us/sample - loss: 0.0222 - accuracy: 0.9994
Epoch 3/5
1800/1800 [==============================] - 0s 51us/sample - loss: 0.0147 - accuracy: 1.0000
Epoch 4/5
1800/1800 [==============================] - 0s 50us/sample - loss: 0.0102 - accuracy: 1.0000
Epoch 5/5
1800/1800 [==============================] - 0s 50us/sample - loss: 0.0076 - accuracy: 1.0000

results = model.evaluate(X_test, y_test, verbose=0)
print(results)

[0.604368144646287, 0.8225]

history_dict = history.history

import matplotlib.pyplot as plt

acc = history_dict['accuracy']

loss = history_dict['loss']
epochs = range(1, len(acc) + 1)

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()


Tags: