Semantic Role Labeling (SRL), also called Thematic Role Labeling, Case Role Assignment or Shallow Semantic Parsing is the task of automatically finding the thematic roles for each predicate in a sentence. It answers the who did what to whom, when, where, why, how and so on. Finding these relations is preliminary to question answering and information extraction.

For example, the sentence “Peter bought a car from Anna for more than €2000,-.” can be transformed with SRL like so:

It indicates who bought what from whom and for how much, with the verb being central to the set of questions.

Why care?

SRL is crucial component in understanding language:

  • it articulates question answering: the basic w-questions (what, where…) can be answered and is a step on the way to higher understanding of text (and speech)
  • it can be a way to summarize content and grab the essence of large content
  • it can enable a graph-representation of language and through this the construction of semantic representations
  • if you wish to develop chatbots SRL can be an alternative to training utterance-intent nets
  • it can be a way to compare translations and equal content in different languages


Solutions and implementations

Implementing SRL is hard. Using the successes of neural networks there is a new approach called deep semantic learning (see e.g. this article) but whether shallow or deep, you need annotated data and linguistic insights to achieve something.

  • The screenshot above was taken from the AllenNLP demo and not only does SRL but also has a nice visualization of the result. The framework is open source and allows you to train your own net.
  • FrameNet is an early (around 1998) framework developed by Berkeley which encodes semantic frames targeting words and lexical predicates:

FrameNet is building a lexical database of English that is both human- and machine-readable, based on annotating examples of how words are used in actual texts. From the student’s point of view, it is a dictionary of more than 13,000 word senses, most of them with annotated examples that show the meaning and usage. For the researcher in Natural Language Processing, the more than 200,000 manually annotated sentences linked to more than 1,200 semantic frames provide a unique training dataset for semantic role labeling, used in applications such as information extraction, machine translation, event recognition, sentiment analysis, etc. For students and teachers of linguistics it serves as a valence dictionary, with uniquely detailed evidence for the combinatorial properties of a core set of the English vocabulary. The project has been in operation at the International Computer Science Institute in Berkeley since 1997, supported primarily by the National Science Foundation, and the data is freely available for download. It has been downloaded and used by researchers around the world for a wide variety of purposes (see FrameNet downloaders). FrameNet-like databases have been built for a number of languages and a new project is working on aligning the FrameNets across languages.

In a way, FrameNet is the next level of language understanding after WordNet. FrameNet on its own does not allow you to output things, you need to integrate it. There are various open source project in this direction:

  1. The Mateplus semantic role labeler.
  2. Open Sesame is a frame-semantic parser for automatically detecting semantic frames and their arguments.
  3. Semafor is yet another frame-semantic parser.
  • The Illinois Curator package contains a lot of NLP tools and various SRL versions.
  • PractNLPTools: Practical Natural Language Processing Tools for Humans. Dependency Parsing, Syntactic Constituent Parsing, Semantic Role Labeling, Named Entity Recognisation, Shallow chunking, Part of Speech Tagging, all in Python.
  • Although Spacy does not have SRL out of the box you can merge a bit of Spacy and AllenNLP.
     from allennlp.commands import DEFAULT_MODELS
        from allennlp.common.file_utils import cached_path
        from allennlp.service.predictors import SemanticRoleLabelerPredictor
        from allennlp.models.archival import load_archive
        import spacy
        from spacy.tokens import Token
        class SRLComponent(object):
            name = 'Semantic Role Labeler'
            def __init__(self):
                archive = load_archive(self._get_srl_model())
                self.predictor = SemanticRoleLabelerPredictor.from_archive(archive, "semantic-role-labeling")
            def __call__(self, doc):
                # See
                words = [token.text for token in doc]
                for i, word in enumerate(doc):
                    if word.pos_ == "VERB":
                        verb = word.text
                        verb_labels = [0 for _ in words]
                        verb_labels[i] = 1
                        instance = self.predictor._dataset_reader.text_to_instance(doc, verb_labels)
                        output = self.predictor._model.forward_on_instance(instance, -1)
                        tags = output['tags']
                        # TODO: Tagging/dependencies can be done more elegant 
                        if "B-ARG0" in tags:
                            start = tags.index("B-ARG0")
                            end = max([i for i, x in enumerate(tags) if x == "I-ARG0"] + [start]) + 1
                            word._.set("srl_arg0", doc[start:end])
                        if "B-ARG1" in tags:
                            start = tags.index("B-ARG1")
                            end = max([i for i, x in enumerate(tags) if x == "I-ARG1"] + [start]) + 1
                            word._.set("srl_arg1", doc[start:end])
                return doc
            def _get_srl_model(self):
                return cached_path(DEFAULT_MODELS['semantic-role-labeling'])
        def demo():
            nlp = spacy.load("en")
            nlp.add_pipe(SRLComponent(), after='ner')
            doc = nlp("Apple sold 1 million Plumbuses this month.")
            for w in doc:
                if w.pos_ == "VERB":
                    print("('{}', '{}', '{}')".format(w._.srl_arg0, w, w._.srl_arg1)) 
                    # ('Apple', 'sold', '1 million Plumbuses)