Code on Github

Natural language processing is, let’s be honest, seriously biased towards English and if your customer happens to be in a cultural region where it’s not the main language you have an issue. Or an opportunity to articulate something new. In any case, you have a few challenges:

  • finding relevant data: preferably free to use and pre-tagged. A ton of pdf’s will not do.
  • some language understanding: stop-words, sentence structure and such.
  • framework understanding: if your favorite NLP framework doesn’t contain pre-trained models for the language you have to dig a bit deeper to build things yourself.

In this article I’ll focus on Dutch since this happens to be one of the common languages here in Belgium. While this isn’t spoken by many (compared to, say, French or Spanish) it nevertheless

  • happens to have some support in frameworks like Gensim, SpaCy and NLTK
  • isn’t as challenging as Chinse, Greek or Bulgarian: Dutch uses the standard Latin alphabet
  • is not wildly different from English when considering things like lemmatization or sentence splitting.

The availability of (trained) data is a real issue however and Dutch is in this respect the prototypical situation: it’s difficult to find quality (tagged) data. The emphasis here is on tagged and quality:

  • of course you have things like Gutenberg but books alone won’t get you very far if you are interested in extracting information out of data. You need labels, tags or something which relates to what you’re after. Machine (and human) learning is based on patterns in relationships.
  • unstructured texts means you’ll spend first an eternity cleaning data before effectively turning to the task at hand. A customer handing over a million of pdf’s and expecting you to extract magically all they need out of it is a common misunderstanding.

Now, assuming the data is present you can start exploring various NLP tasks. For marketing purposes one can look at sentiment analysis in function of product placement or customer propensity. If you need to classify documents you can look at summarization or keyword extraction. Note that if your corpus is large you might benefit from things like Lucene rather than an homemade engine. In function of a realtime NLP engine (e.g. processing telephone calls) you can look at entity extraction aka named entity recognition (NER).

Every NLP process is really based on the following recipe:

  • gather data, preferably tons of it. As many docs as possible.
  • clean the data: remove stop words, noise, irrelevant bits, whatnot
  • chunk the data in a way it suits your aims
  • label the parts in a way the learning step can use it: if you are interested in sentiments label paragraphs or words with some emotional coefficient, if you are after part-of-speech label words with POS tags etc.
  • find a way to convert words and labels to numbers: machine learning does not handle words or characters but numbers. Use anything which makes it happen: word-to-vectors, feature functions, counting (bag of words, matrix…)
  • figure out what algorithm works best, use whatever framework to train a model
  • define test data or a way to test the accuracy of the model
  • optimize the accuracy throuhg gridsearch or whatever works best
  • wrap the model in a consumable, say a REST service or server-less micro-service
  • keep trying to improve the model in whatever way you can

Sometimes some of the steps can be skipped. For example, if you find tagged POS data you don’t need to clean/tag things yourself. Similarly, often you have various algorithms at your disposal in an NLP framework which fit perfectly. The devil is in the details, of course. Maybe NLTK is great for a small corpus but will not do if you need a Spark cluster to deal with petabytes. For a medium corpus you might need to develop your own out-of-memory algorithm. Plenty of subtleties indeed and the road to a good model is never linear.

In what follows I’ll focus on NER in Dutch input and show how you can train your own NER model. The recipe describe above translates to:

  • the NLP conference of 2002 in Taiwan produced the so-called CoNLL2002 corpus with a mix of Spanish and Dutch tagged data. This corpus sits in the NLTK framework and is easily accessible. A much larger corpus is available but is not free and demands an out-of-memory approach. The so-called SoNaR corpus is a 60 GB compressed corpus with around 500 million words but requires a different approach than the one we outline below.
  • cleaning the data is not necessary in this case because POS and IOB tags are present. If this is not the case in your project you have a major logistic challenge. Annotating text data is in many ways a hurdle.
  • the NLTK framework knows about Dutch stop-words and tokenization. This makes it easy to chunk the raw text into sentences and words.
  • the CoNLL2002 corpus contains IOB and POS labels. It doesn’t mean that the info can be used as such however. Feature extraction and engineering is in any machine learning (ML) task an art on its own.
  • we will use conditional random fields (CRF) to create a NER model. A CRF sits in between basic standard NML algorithms (say, SVM) and non-linear algorithms like LSTM neural networks in terms of complexity. Just like an LSTM network a CRF has knowledge about how bits of data are related. It also has similarities with hidden Markov models (HMM) but generalizes the notion of dependency. Using LSTM would engender a whole playful process of optimizing layers and transition functions. Note, I’m telling you a GRU or LSTM would not perform better than CRF, only that it make it more complex to optimize and to describe.
  • using scikit-learn’s grid search the accuracy is optimized across a hyperparameter domain
  • we’ll wrap the model in a simple REST service using Flask. This can be deployed on AWS or as a docker container. In fact, you can find in the github repo the dockerfile for dockerization.

 

Conditional random fields in a nutshell

For a thorough overview see An introduction to conditional random fields by Charles Sutton and Andrew McCallum in Machine Learning Vol. 4, No. 4 (2011) 267–373. Here I’ll only skim the surface to give you an idea how things function.

When dealing with textual data you need to find a way to convert text to numbers. A common approach is to use embeddings like word2vec or doc2vec. Another or complementary approach is to use feature functions; ad-hoc mappings from words to numbers. Say you have sentences latexs with word labels latex(wi,λi). This could be POS or IOB tags. A feature function f_j maps the (w_i, \lambda_i)\mapsto f_j(w_i, \lambda_i) to some number. For example, a feature function could emphasize adjective and assign the number one when an adjective is found and zero otherwise. In general a feature function can take a window into account. The feature function could look at the previous word or the next-next-word as well. This is where it differentiates from LSTM or HMM. An hidden MArkov model only takes the current state into account. An LSTM tries to remember things in function of the defined windows. So, the feature function could be f_j(w_i, w{i-1}, \lambda_i) if the previous word is included.
For a given sentence you can have multiple feature function. One to pick up names of locations, names of organizations and so on. Each feature function is now weighted and for one sentence you get the sum

S(s, \lambda) = \sum_{ij} \rho_j f_j(w_i, w_{i-1}, \lambda_i)

specific to the labeling and the sentence under consideration. Now to turn this into a probability you use the usual softmax and get

p(s, \lambda) = \frac{1}{Z}\exp - S(s, \lambda)

with Z the partition function or normalization to ensure that the probabilities sum up to one. Now the machine learning consists of optimizing the \lambda‘s to maximize the probabilities. This happens by means of gradient ascend and is similar to training a neural network and alike. Of course, you need lots of sentences and feature function which effectively return what you are looking for.

Assuming the training returns optimal weights one can use (polynomial-time) dynamic programming algorithms to find the optimal labels, similar to the Viterbi algorithm for HMMs.

NLTK in a nutshell

The natural language toolkit is ideal for experimenting with NLP. Probably it’s not the tool you’d use for large scale text processing, but that’s another topic. It contains all you need to experiment with text and there are many subtletites you need to be aware of when looking at something else than English.

  • splitting documents into useful paragraphs or blocks is usually something outside the scope of NLTK. If you want to split stanzas in poems you will have to look at line separations. If you want to extract addresses out of Word documents you will have to find appropriate ways to delete obsolete parts of look at markers which define begin/end of the blocks.
  • splitting paragraphs in sentences is language dependent. This might be a surprise since you naively could assume that splitting at the ‘.?!’ is all you need to do. Things like ‘Ph.D.’ (English) and ‘dhr.’ (Dutch) spoil the fun however. Language specific sentence splitting is not too difficult using the NLTK trainer and we’ll hightlight the procedure below.
  • splitting sentences in words is also language dependent. The Dutch ‘s avonds is one word but in English the ‘s will be considered as the word ‘is’ and hence a word on its own. Word tokenizing is hence something which has to be trained as well. Here again, there are tools and open source projects which can help you. The problem is usually finding quality data to train the tokenizer.
  • removing punctuation is the easy part and can often even be done with regular expressions
  • removing stop-words is also easy since Dutch, like English, has a limited set of stop words and NLTK actually contains them as a resource
  • verbs and tenses: the proliferation of the same thing in many shapes. The process of normalizing words to a common root shape is called lemmatization and stemming (the difference is subtle). NLTK can help you with this but like anything else: maybe you need to train your own model. For example, using the 1604 King James Bible with the current English stemmer will not give the expected result. Dialects (the difference between Flemish and Dutch for instance) can also inject mistakes.

At the end of this article you will find a collection of code snippets which show you how NLTK deals with the aspects enumerated above.

 

Named entities

With all of this contextual info out of the way we can focus on the training a model for entity recognition.

Named entities are recognized through POS tags and so-called IOB (aka BIO) tags. The IOB tags indicate whether a word is of a particular type (organization, person etc.). The NLTK conll2002 corpus has all you need:

list(nltk.corpus.conll2002.iob_sents('ned.train'))[:1]
[[('De', 'Art', 'O'),
      ('tekst', 'N', 'O'),
      ('van', 'Prep', 'O'),
      ('het', 'Art', 'O'),
      ('arrest', 'N', 'O'),
      ('is', 'V', 'O'),
      ('nog', 'Adv', 'O'),
      ('niet', 'Adv', 'O'),
      ('schriftelijk', 'Adj', 'O'),
      ('beschikbaar', 'Adj', 'O'),
      ('maar', 'Conj', 'O'),
      ('het', 'Art', 'O'),
      ('bericht', 'N', 'O'),
      ('werd', 'V', 'O'),
      ('alvast', 'Adv', 'O'),
      ('bekendgemaakt', 'V', 'O'),
      ('door', 'Prep', 'O'),
      ('een', 'Art', 'O'),
      ('communicatiebureau', 'N', 'O'),
      ('dat', 'Conj', 'O'),
      ('Floralux', 'N', 'B-ORG'),
      ('inhuurde', 'V', 'O'),
      ('.', 'Punc', 'O')]]

The ‘B-ORG’ indicates that the word ‘Floralux’ is an organization. With this type of info one can train a model to recognize unseen sentences. But if recognition is based on IOB tags then how can you use a normal (i.e. untagged) sentences with the model? You need to train a model which learns how to attach tags? To this end you need to clone the NLTK trainer project on Github which dixit ‘…trains NLTK objects with zero code…’.

This training is a machine learning project on its own but if you are in a hurry, all you need to do is run:

 python train_chunker.py conll2002 --fileids ned.train --classifier NaiveBayes --filename ~/nltk_data/chunkers/conll2002_ned_NaiveBayes.pickle

This will create a pickled model which tags arbitrary Dutch sentences with IOB tags. This tagged array can thereafter be used with the NER model we will build.

If you are not in hurry you should replace the ‘NaiveBayes’ classifier in the instruction above with ‘DecisionTree’. It will take around 15min more but your tagger will be 4% more accurate (something like 98% accuracy). Beside DecisionTree you also experiment with Maxent, GIS, IIS, MEGAM and TADM. See the docs for more.

The NER training and testing data is easily extracted from the NLTK resources like so:

from itertools import chain

import nltk
import sklearn
import scipy.stats
from sklearn.metrics import make_scorer
from sklearn.cross_validation import cross_val_score
from sklearn.grid_search import RandomizedSearchCV

import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

train_sents = list(nltk.corpus.conll2002.iob_sents('ned.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('ned.testb'))

If you’d rather train a Spanish NER model you can replace ‘ned’ with ‘spa’ above.

Referring to the random fields above, we observed that one can use a window (aka n-gram) for the feature functions. In the feature mapping below you can see how we use a 5-gram and collect various bits of info as input for the CRF:

def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i - 1][0]
        postag1 = sent[i - 1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent) - 1:
        word1 = sent[i + 1][0]
        postag1 = sent[i + 1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features

With this feature extraction we assemble the actual data for the training and testing:

def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]


def sent2labels(sent):
    return [label for token, postag, label in sent]


def sent2tokens(sent):
    return [token for token, postag, label in sent]


X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

This type of manipulations are fairly standard if you have used scikit-learn before.

Training then is as simple as this:

crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
crf.fit(X_train, y_train)

giving

CRF(algorithm='lbfgs', all_possible_states=None,
  all_possible_transitions=True, averaging=None, c=None, c1=0.1, c2=0.1,
  calibration_candidates=None, calibration_eta=None,
  calibration_max_trials=None, calibration_rate=None,
  calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
  gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
  max_linesearch=None, min_freq=None, model_filename=None,
  num_memories=None, pa_type=None, period=None, trainer_cls=None,
  variance=None, verbose=False)

This takes very little time but it’s not optimized. The optimization consists of tuning the hyperparameters of the algorithm. In the case of CRF, the c1 and c2 params.
The sklearn framework has this wonderful gridsearch mechanics which allows you to automatically figure out which parameters maximize a metric. To use it you need:

  • to define the intervals or enumerations inside which the optimization has to search (the hyperparameter space)
  • the metric which tells the optimization what ‘better’ means

You can find plenty of docs and info around this topic.

params_space = {
    'c1': scipy.stats.expon(scale=0.5),
    'c2': scipy.stats.expon(scale=0.05),
}

# use the same metric for evaluation
f1_scorer = make_scorer(metrics.flat_f1_score,
                        average='weighted')

# search
rs = RandomizedSearchCV(crf, params_space,
                        cv=3,
                        verbose=1,
                        n_jobs=-1,
                        n_iter=50,
                        scoring=f1_scorer)

rs.fit(X_train, y_train)

This will take a while. The ‘crf’ model now contains an optimized NER detection model which can be used independently of the training and everything we’ve done above.

Once the training returns you can save the model via:

from sklearn.externals import joblib
joblib.dump(crf, 'Dutch.pkl')

One important remark is in order. You can see that the whole training and testing data is loaded in memory. This approach obviously does not work with large datasets and there are so-called out-of-core algorithms which can help. This means however that you need to figure out how training can happen incrementally. Or you need a different approach, for example using MLlib with a Spark cluster which eventually demands some experience with Scala or PySpark.

RESTifying the model

The trained model can be reused in a Python pipeline but it’s very likely that your architecture is heterogenous and your consumer is not based on Python. One way to deploy the model is by means of a server-less AWS service.

Another way is to create a docker container with a REST service calling the model. Creating a REST service can be done via Bottle, Django, Flask or whatever your favorite framework is. In the repo you will find a Flask service:

# @app.route('/ner/<sen>')
def ner(sen):
    tagged = [tagger.tag(word_tokenize(sen))]
    p = crf.predict([sent2features(s) for s in tagged])
    r = []
    e = tagged[0]
    for i in range(len(e)):
        tag = p[0][i]
        parts = tag.split("-")
        if len(parts)>1:
            what = parts[1]
        else:
            what = parts[0]
        if what != "O":
            r.append({"word": e[i][0], "pos": e[i][1], "entity": what})
    return jsonify(r)

As advertized earlier, the only thing happening on this level is transforming natural language to a format the model expects. The NER seervice really is just a call to the ‘predict’ method.

In order to dockerize all this you need a very basic Linux with Python image and some YML files. You can find all of this in the repo. The only thing you need to do is to call

    docker-compose up

in the directory where the YML files reside. See the docker composer docs for more info.

NLTK snippets

import nltk
# a corpus can be downloaded using nltk.download() which brings up a neat UI
from nltk.corpus import conll2002

Regarding Dutch there are a few key-resources

  • the CoNLL2002 corpus included with NLTK. It’s however a mixture of Spanish and Dutch so it’s vital to filter out the Dutch sentences only.
  • the Groningen Meaning Bank
  • the SoNaR corpus which seems to be the most complete one can find (500 million words!).

The conll2002 corpus contains both Spanish and Dutch so you need to filter out only the Dutch part, for example

for doc in conll2002.chunked_sents('ned.train')[:1]:
    print(doc)
(S
  De/Art
  tekst/N
  van/Prep
  het/Art
  arrest/N
  is/V
  nog/Adv
  niet/Adv
  schriftelijk/Adj
  beschikbaar/Adj
  maar/Conj
  het/Art
  bericht/N
  werd/V
  alvast/Adv
  bekendgemaakt/V
  door/Prep
  een/Art
  communicatiebureau/N
  dat/Conj
  (ORG Floralux/N)
  inhuurde/V
  ./Punc)

To default sent_tokenize method uses by default English so you need to override this:

from nltk.tokenize import sent_tokenize
raw = "Een goede reputatie is beter dan het duurste parfum. Dhr. Jansen heeft mevr. Vandaele gehuwd."
for sent in sent_tokenize(raw, language='dutch'):
     print(sent)
     print('--------')
Een goede reputatie is beter dan het duurste parfum.
--------
Dhr.
--------
Jansen heeft mevr.
--------
Vandaele gehuwd.
--------

Which is not at all what you want. So you need to train your own tokernizer and add explicitly the stuff you considere as not splitting to a sentence.

The PunktTrainer class is an unsupervised learner which can be used for this purpose:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer
 
text = conll2002.raw("ned.train")

trainer = PunktTrainer()
trainer.INCLUDE_ALL_COLLOCS = True
trainer.train(text)
tokenizer = PunktSentenceTokenizer(trainer.get_params())

The abbreviations which are considered as not splitting a sentence can be obtained from

print(sorted(tokenizer._params.abbrev_types))
["'dr", "'mr", "'w", '1.ple', 'a.b', 'a.d', 'a.j', 'a.s', 'ami', 'ang', 'ant', 'ara', 'av', 'ave', 'b', 'b.b', 'banq', 'bap', 'blz', 'boa', 'br', 'bru', 'brut', 'burs', 'c', 'cai', 'calp', ...]

Let’s look at an example:

sentence = "Mr. Jansen vertelde aan Mevr. Vandaele het tragische verhaal."
tokenizer.tokenize(sentence)
['Mr.', 'Jansen vertelde aan Mevr. Vandaele het tragische verhaal.']

Not what you want. You can ask NLTK how splitting decisions are made:

for decision in tokenizer.debug_decisions(sentence):
    pprint(decision)
    print( '=' * 30)
{'break_decision': True,
 'collocation': False,
 'period_index': 2,
 'reason': 'default decision',
 'text': 'Mr. Jansen',
 'type1': 'mr.',
 'type1_in_abbrs': False,
 'type1_is_initial': False,
 'type2': 'jansen',
 'type2_is_sent_starter': False,
 'type2_ortho_contexts': {'UNK-UC'},
 'type2_ortho_heuristic': 'unknown'}
==============================
{'break_decision': None,
 'collocation': False,
 'period_index': 28,
 'reason': 'default decision',
 'text': 'Mevr. Vandaele',
 'type1': 'mevr.',
 'type1_in_abbrs': True,
 'type1_is_initial': False,
 'type2': 'vandaele',
 'type2_is_sent_starter': False,
 'type2_ortho_contexts': {'UNK-UC'},
 'type2_ortho_heuristic': 'unknown'}
==============================

Adding your own non-splitting tokens is now as simple as

tokenizer._params.abbrev_types.add('mr') 
tokenizer.tokenize(sentence)
['Mr. Jansen vertelde aan Mevr. Vandaele het tragische verhaal.']

which now correctly interpretes the whole string as one sentence.

Word tokenization is similar to sentence splitting. Maybe you need to train your own tokenizer, maybe not. The default approach works sometimes:

from nltk import word_tokenize
word_tokenize(sentence)
['Mr.',
 'Jansen',
 'vertelde',
 'aan',
 'Mevr',
 '.',
 'Vandaele',
 'het',
 'tragische',
 'verhaal',
 '.']

The REPP parser can help if you need to run your own.

Removing punctuation can be implemented with regular expressions:

import gensim, string, re, pickle
from nltk import RegexpTokenizer, word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from os import listdir
from os.path import isfile, join
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
def remove_punctuation(data):
    """
    Removes unwanted punctuation from the given stanzas.
    Args:
       tokens (array): An array of arrays of words.
    """
    cleaned = []
    ignored_punctuation = string.punctuation + '’' 
    #see documentation here: http://docs.python.org/2/library/string.html
    regex = re.compile('[0-9%s]' % re.escape(ignored_punctuation))
    for sent in data:
        new_sent = []
        for word in sent: 
            new_token = regex.sub(u'', word)
            if not new_token == u'':
                new_sent.append(new_token)

        cleaned.append(new_sent)
    return cleaned

You can add whatever language specific char to the ignored_punctuation variable above.

Removing stop-words can be based on the predefined ones:

def remove_stopwords(stanzas):
    tokenized_docs_no_stopwords = []
    noise = stopwords.words('dutch')
    for doc in stanzas:
        new_term_vector = []
        for word in doc:
            if not word in noise:
                new_term_vector.append(word)
        tokenized_docs_no_stopwords.append(new_term_vector)
    return tokenized_docs_no_stopwords

So is normalization:

def normalize(stanzas):
    snowball = SnowballStemmer("dutch")
    result = []
    for doc in stanzas:
        final_doc = []
        for word in doc:        
            final_doc.append(snowball.stem(word))

        result.append(final_doc)
    return result

If you assemble the above snippets you can go from raw text to clean arrays of arrays of words. Each array representing one sentence.

normalize(remove_punctuation([word_tokenize(sentence)]))
[['mr',
  'jans',
  'verteld',
  'aan',
  'mevr',
  'vandael',
  'het',
  'tragisch',
  'verhal']]

At this point you can start converting the arrays to numbers via word2vec, doc2vec and alike. Once the words have become numbers you can use TensorFlow, Gensim or MXNet to learn from the data.