How to transform text into data in function of machine learning? Here are a few ideas and techniques.

Counting words (CountVectorizer)

Given a piece of text, assemble all the different words and assign a different number to it. This gives a dictionary through which you can map a new given text to a vector.

Consequences:

  • any permutation of words in a given sentence leads to the same vector
  • the amount of time the same word appears increases the weight of the entry
  • a word not in the vocabulary is not contributing

    from sklearn.feature_extraction.text import CountVectorizer
    # some text from which a vocabulary will be extracted
    text = ["Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away."]
    # create the transform
    vectorizer = CountVectorizer()
    # learn the vocabulary
    vectorizer.fit(text)
    # summarize
    print(vectorizer.vocabulary_)
    # encode something with the same vocabulary
    vector = vectorizer.transform(["Nothing is left of perfection"])
    # what does it look like?
    ar = vector.toarray()
    print(ar)
    # inverse transform
    str.join(" " ,vectorizer.inverse_transform(ar)[0])
    # any permutation will give the same vector
    print(vectorizer.transform(["Is left nothing of perfection?"]).toarray())
    # if not in the vocabulary it does not contribute
    print(vectorizer.transform(["What is time?"]).toarray())
    print(vectorizer.transform(["What is life?"]).toarray())
    

TfIdf

TfIDf stands for term frequency–inverse document frequency. It means that a term gets a weight proportional to its statistics but at the same time is dimished if it appears often. For example, the string “the cupidity of man” contains the common words “the” and “of” so their importance is in general less than the infrequent word “cupidity”. So, the statistics is in a way reorganized in function of global properties of all words.

Consequences:

  • just like the count vectorizer, permutations do not affect the vector
  • the weight of ‘important’ words is taken into account
  • a word not in the vocabulary is not contributing

    from sklearn.feature_extraction.text import TfidfVectorizer
    # list of text documents
    text = ["Time is the essence of life.",
      "Time is life.",
      "The quantum mechanics of anyons."]
    # create the transform
    vectorizer = TfidfVectorizer()
    # tokenize and build vocab
    vectorizer.fit(text)
    # summarize
    print(vectorizer.vocabulary_)
    print(vectorizer.idf_)
    # encode document
    print(vectorizer.transform(["Quantum of life"]).toarray())
    print(vectorizer.transform(["Life of lions"]).toarray())
    print(vectorizer.transform(["Lions of life"]).toarray())
    

Hashing

The previous approaches mean that the more words you conside the bigger the vector becomes. It also means that vector become sparse. The hashing approach take a fixed-length vector and manages to incorporte all words into the same-length vector. This is similar to the Word2Vec approach but much less sophisticated.

Advantages:

  • it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory
  • it is fast to pickle and un-pickle as it holds no state besides the constructor parameters
  • it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.

Cons:

  • there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.
  • there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).
  • no IDF weighting as this would render the transformer stateful.

    from sklearn.feature_extraction.text import HashingVectorizer
    # list of text documents
    text = ["Before I got married I had six theories about bringing up children; now I have six children and no theories."]
    # create the transform
    vectorizer = HashingVectorizer(n_features=10)
    print(vectorizer.transform(["I got married."]).toarray())
    print(vectorizer.transform(["Married, I got."]).toarray())
    print(vectorizer.transform(["Married, I got on Monday."]).toarray())
    

Inside Keras there is also a hashing utility which works well.

from keras.preprocessing.text import hashing_trick
from keras.preprocessing.text import text_to_word_sequence
# define the document
text = 'An essential component of an artist’s toolkit, we have vastly improved our visual effects features.'
# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)
# integer encode the document
result = hashing_trick(text, round(vocab_size*1.3), hash_function='md5')
print(result)

Word2Vec

This method uses shallow neural networks and word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space. So, this method is the most subtle one.

Pro:

  • Perhaps the first scalable model that generated word embeddings for large corpus (millions of unique words). Feed the model raw text and it outputs word vectors.
  • it takes context into account

Cons:

Word sense is not captured separately. For example, a word like “cell” that could mean “prison, “biological cell”, “phone” etc are all represented in one vector.

import spacy
nlp = spacy.load('en')
txt = open('Some_text_somewhere.txt', encoding='utf_8').read()
doc = nlp( txt )
words = [token.lemma_.lower() for token in doc 
 if not (token.is_punct | token.is_space | token.is_digit) ]
 
import gensim
# let X be a list of tokenized texts (i.e. list of lists of tokens)
model = gensim.models.Word2Vec([words], size=100)
w2v = dict(zip(model.wv.index2word, model.wv.syn0))

One hot

Using Keras as a way to train networks, you can use some utils therein as well. The one-hot encoder for example is also useful sometimes to encode text in a vector.

from keras.preprocessing.text import one_hot
from keras.preprocessing.text import text_to_word_sequence
# define the document
text = 'Designers can switch effortlessly between motion graphics and VFX, within a unified system built on the most intuitive particle software on the market.'
# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)
# integer encode the document
result = one_hot(text, round(vocab_size))
print(result)

Keras text to matrix

Finally, if you want to shortest path between textual data and network input you can use the text-to-matrix utility in Keras to quickly get vectors.

from keras.preprocessing.text import Tokenizer
# define 5 documents
docs = ['The time is now',
        'Time is life',
        'All will be well',
        'All in time',
        'Life is the essence of time.']

# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(docs)
# summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)
# integer encode documents
encoded_docs = t.texts_to_matrix(docs, mode='count')
print(encoded_docs)