Gist

This is a straightforward approach to map natural language to intents in the form of labels. So the whole effort really is in first converting the NL input into vectors and training an appropriate network which understands arbitrary variations.

Mapping input to intends is a common approach to chat-bots in order to present the appropriate response. For example, there are many ways to request a train ticket but it all maps to the same purchase procedure. So, this kinda learning attempts to handle all ways.

If you need a solid intend-mapping solution have a look at rasa NLU

First we’ll go through the somewhat easier task of using a limited vocabulary.

import keras
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import SimpleRNN
from keras.layers.core import Dense, Dropout, Flatten
from keras.layers.wrappers import TimeDistributed
from keras.layers import Convolution1D, LSTM
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.utils import np_utils
from keras.preprocessing.text import Tokenizer
import itertools
import numpy as np
from keras.utils.np_utils import to_categorical

The data we’ll use in the limited case is as follows:

train = ["What would it cost to travel to the city on Monday?",
         "Need to travel this afternoon",
         "I want to buy a ticket",
         "Can I order a trip?", 
         "I would like to buy a ticket to Brussels", 

         "What will be the weather tomorrow?",
         "Will it rain this afternoon?",
         "The sunshine feels great",
         "Can you predict rain?",
         "Guess I should wear a jacket hey!",

        "Dit is geheel iets anders",
         "Kan ik dit goed vinden",
         "Wat is dit soms goed",
        "Maar anders is soms goed"]

T = "Buy a train ticket"
W = "Asking about the weather"
F = "Babble in 't Vlaamsch"
labelsTrain = [T,
               T,
               T,
               T,
               T,

               W,
               W,
               W,
               W,
               W,

               F,
               F,
               F,
               F]

test = [
        "Do you think it will be sunny tomorrow?",
        "What a wonderful feeling in the sun!",
        "How can I travel to Leuven?",
        "Can I buy it from you?",
        "Anders is heel goed"
       ]
labelsTest = [W, W, T, T, F]

This data is constrained in the sense that the testing questions use only words which were trained. In this case we can use the Keras Tokenizer class:

tokenizer = Tokenizer()
all_texts = train + test
tokenizer.fit_on_texts(all_texts)
# print(tokenizer.word_index)

X_train = tokenizer.texts_to_matrix(train)
X_test = tokenizer.texts_to_matrix(test)

The labels are converted to index vectors in order to use categorical crossentropy.

all_labels = labelsTest + labelsTrain
labels = set(all_labels)
idx2labels = list(labels)
label2idx = dict((v, i) for i, v in enumerate(labels))

y_train = to_categorical([label2idx[w] for w in labelsTrain])
y_test = to_categorical([label2idx[w] for w in labelsTest])

The following network is a direct variation of the examples you can find in the article Embedding and Tokenizer in Keras:

vocab_size = len(tokenizer.word_index) + 1

model = Sequential()
model.add(Embedding(2, 45, input_length= X_train.shape[1], dropout=0.2 ))
model.add(Flatten())
model.add(Dense(50, name='middle'))
model.add(Dropout(0.2))
model.add(Dense(3, activation='softmax', name='output')) 

model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

model.fit(X_train, y=y_train, nb_epoch=1500, verbose=0, validation_split=0.2, shuffle=True)

scores = model.evaluate(X_test, y_test, verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
acc: 100.00%

Even without sophisticated RNN you get all the accuracy you can wish. Note that despite the lack of non-learned words you can still get results:

model.predict(tokenizer.texts_to_matrix(["Welke dag is het vandaag?"])).round()
array([[ 1.,  0.,  0.]], dtype=float32)

This corresponds indeed to Flemish since the indices are

idx2labels
["Babble in 't Vlaamsch", 'Buy a train ticket', 'Asking about the weather']

Using word2vec and LSTM

Below you can find networks on the basis of word2vec and LSTM. Details can be found in the article Embedding and Tokenizer in Keras.
Why is the accuracy of these networks so low? There are a few factors:

  • the training data is measure zero
  • the pretrained word embedding is in English, so the Flemish language is not embedded and hence not learned
  • there is no temporal correlation and the learning is not used towards predicting the next word or something like this, so LSTM is not very meaningful here

Still, an accuracy of 20% while the plain dense network gives 100% is surprising. Definitely a proof that ‘more’ is not always automatically better in this field.

embeddings_index = {}
# see here to download the pretrained model
# http://nlp.stanford.edu/projects/glove/
glove_data = '~/Desktop/AIML/Glove/glove.6B.50d.txt'
f = open(glove_data)
for line in f:
    values = line.split()
    word = values[0]
    value = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = value
f.close()

print('Loaded %s word vectors.' % len(embeddings_index))
Loaded 400000 word vectors.
embedding_dimension = 50
word_index = tokenizer.word_index
embedding_matrix = np.zeros((len(word_index) + 1, embedding_dimension))

for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector[:embedding_dimension]

embedding_layer = Embedding(embedding_matrix.shape[0],
                            embedding_matrix.shape[1],
                            weights=[embedding_matrix],
                            name='W2V embedding',
                            input_length=len(word_index) + 1)

from keras.preprocessing.sequence import pad_sequences
X_train = tokenizer.texts_to_sequences(train)
X_train = pad_sequences(X_train, maxlen=len(word_index) + 1)
model = Sequential()
model.add(embedding_layer)
model.add(Flatten())
model.add(Dense(50, activation='sigmoid', name='middle layer'))
model.layers[0].trainable=False # bug in Keras or Theano
model.add(Dropout(0.2))
model.add(Dense(3, activation='softmax', name='output')) 

model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

model.fit(X_train, y=y_train, nb_epoch=2500, verbose=0, validation_split=0.2, shuffle=True)

scores = model.evaluate(X_test, y_test, verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
acc: 40.00%
model = Sequential()
model.add(embedding_layer)
model.add(LSTM(128, dropout_W=0.2, dropout_U=0.2)) 
model.add(Dense(3, activation='softmax', name='output')) 

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(X_train, y=y_train, nb_epoch=1000, verbose=0, validation_split=0.2, shuffle=True)

scores = model.evaluate(X_test, y_test, verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

acc: 20.00%