How to transform text into data in function of machine learning? Here are a few ideas and techniques.
Counting words (CountVectorizer)
Given a piece of text, assemble all the different words and assign a different number to it. This gives a dictionary through which you can map a new given text to a vector.
- any permutation of words in a given sentence leads to the same vector
- the amount of time the same word appears increases the weight of the entry
a word not in the vocabulary is not contributing
from sklearn.feature_extraction.text import CountVectorizer # some text from which a vocabulary will be extracted text = ["Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away."] # create the transform vectorizer = CountVectorizer() # learn the vocabulary vectorizer.fit(text) # summarize print(vectorizer.vocabulary_) # encode something with the same vocabulary vector = vectorizer.transform(["Nothing is left of perfection"]) # what does it look like? ar = vector.toarray() print(ar) # inverse transform str.join(" " ,vectorizer.inverse_transform(ar)) # any permutation will give the same vector print(vectorizer.transform(["Is left nothing of perfection?"]).toarray()) # if not in the vocabulary it does not contribute print(vectorizer.transform(["What is time?"]).toarray()) print(vectorizer.transform(["What is life?"]).toarray())
TfIDf stands for term frequency–inverse document frequency. It means that a term gets a weight proportional to its statistics but at the same time is dimished if it appears often. For example, the string “the cupidity of man” contains the common words “the” and “of” so their importance is in general less than the infrequent word “cupidity”. So, the statistics is in a way reorganized in function of global properties of all words.
- just like the count vectorizer, permutations do not affect the vector
- the weight of ‘important’ words is taken into account
a word not in the vocabulary is not contributing
from sklearn.feature_extraction.text import TfidfVectorizer # list of text documents text = ["Time is the essence of life.", "Time is life.", "The quantum mechanics of anyons."] # create the transform vectorizer = TfidfVectorizer() # tokenize and build vocab vectorizer.fit(text) # summarize print(vectorizer.vocabulary_) print(vectorizer.idf_) # encode document print(vectorizer.transform(["Quantum of life"]).toarray()) print(vectorizer.transform(["Life of lions"]).toarray()) print(vectorizer.transform(["Lions of life"]).toarray())
The previous approaches mean that the more words you conside the bigger the vector becomes. It also means that vector become sparse. The hashing approach take a fixed-length vector and manages to incorporte all words into the same-length vector. This is similar to the Word2Vec approach but much less sophisticated.
- it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory
- it is fast to pickle and un-pickle as it holds no state besides the constructor parameters
- it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.
- there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.
- there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).
no IDF weighting as this would render the transformer stateful.
from sklearn.feature_extraction.text import HashingVectorizer # list of text documents text = ["Before I got married I had six theories about bringing up children; now I have six children and no theories."] # create the transform vectorizer = HashingVectorizer(n_features=10) print(vectorizer.transform(["I got married."]).toarray()) print(vectorizer.transform(["Married, I got."]).toarray()) print(vectorizer.transform(["Married, I got on Monday."]).toarray())
Inside Keras there is also a hashing utility which works well.
from keras.preprocessing.text import hashing_trick from keras.preprocessing.text import text_to_word_sequence # define the document text = 'An essential component of an artist’s toolkit, we have vastly improved our visual effects features.' # estimate the size of the vocabulary words = set(text_to_word_sequence(text)) vocab_size = len(words) print(vocab_size) # integer encode the document result = hashing_trick(text, round(vocab_size*1.3), hash_function='md5') print(result)
This method uses shallow neural networks and word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space. So, this method is the most subtle one.
- Perhaps the first scalable model that generated word embeddings for large corpus (millions of unique words). Feed the model raw text and it outputs word vectors.
- it takes context into account
Word sense is not captured separately. For example, a word like “cell” that could mean “prison, “biological cell”, “phone” etc are all represented in one vector.
import spacy nlp = spacy.load('en') txt = open('Some_text_somewhere.txt', encoding='utf_8').read() doc = nlp( txt ) words = [token.lemma_.lower() for token in doc if not (token.is_punct | token.is_space | token.is_digit) ] import gensim # let X be a list of tokenized texts (i.e. list of lists of tokens) model = gensim.models.Word2Vec([words], size=100) w2v = dict(zip(model.wv.index2word, model.wv.syn0))
Using Keras as a way to train networks, you can use some utils therein as well. The one-hot encoder for example is also useful sometimes to encode text in a vector.
from keras.preprocessing.text import one_hot from keras.preprocessing.text import text_to_word_sequence # define the document text = 'Designers can switch effortlessly between motion graphics and VFX, within a unified system built on the most intuitive particle software on the market.' # estimate the size of the vocabulary words = set(text_to_word_sequence(text)) vocab_size = len(words) print(vocab_size) # integer encode the document result = one_hot(text, round(vocab_size)) print(result)
Keras text to matrix
Finally, if you want to shortest path between textual data and network input you can use the text-to-matrix utility in Keras to quickly get vectors.
from keras.preprocessing.text import Tokenizer # define 5 documents docs = ['The time is now', 'Time is life', 'All will be well', 'All in time', 'Life is the essence of time.'] # create the tokenizer t = Tokenizer() # fit the tokenizer on the documents t.fit_on_texts(docs) # summarize what was learned print(t.word_counts) print(t.document_count) print(t.word_index) print(t.word_docs) # integer encode documents encoded_docs = t.texts_to_matrix(docs, mode='count') print(encoded_docs)