Download Mathematica notebook

A large-scale unsupervised language model which generates coherent paragraphs of text, achieves state-of-the-art performance on many language modeling benchmarks, and performs rudimentary reading comprehension, machine translation, question answering, and summarization—all without task-specific training.

Original Github code is https://github.com/openai/gpt-2

Original blog article sits here.

Resource retrieval

Get the pre-trained net:

GPT2_1.png

GPT2_2.gif

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

GPT2_3.png

GPT2_4.png

Pick a non-default net by specifying the parameters:

GPT2_5.png

GPT2_6.gif

Basic usage

Given a piece of text, the GPT-2 net produces a sequence of feature vectors of size 768, which correspond to the sequence of input words or subwords:

GPT2_7.png

GPT2_8.png

Obtain dimensions of the embeddings:

GPT2_9.png

GPT2_10.png

Visualize the embeddings:

GPT2_11.png

GPT2_12.gif

Transformer architecture

The input string is first normalized and then tokenized, or split into words or subwords. This two-step process is accomplished using the NetEncoder Function:

GPT2_13.png

GPT2_14.png

The tokenization step is performed using the NetEncoder BPESubwordTokens and can be extracted using the following steps:

GPT2_15.gif

GPT2_16.png

The encoder produces an integer index for each subword token that corresponds to the position in the vocabulary:

GPT2_17.png

GPT2_18.png

Each subword token is also assigned a positional index:

GPT2_19.png

GPT2_20.png

A lookup is done to map these indices to numeric vectors of size 768:

GPT2_21.gif

GPT2_22.gif

For each subword token, these two embeddings are combined by summing elements with ThreadingLayer:

GPT2_23.png

GPT2_24.gif

The transformer architecture then processes the vectors using 12 structurally identical self-attention blocks stacked in a chain:

GPT2_25.png

GPT2_26.gif

The key part of these blocks is the attention module comprising of 12 parallel self-attention transformations, also called “attention heads”:

GPT2_27.png

GPT2_28.gif

Each head uses an AttentionLayer at its core:

GPT2_29.png

GPT2_30.gif

Attention is done with causal masking, which means that the embedding of a given subword token depends on the previous subword tokens and not on the next ones.
This is a prerequisite to be able to generate text with the language model. The following figures compare causal attention to other forms of connectivity between input tokens:

GPT2_31.gif

Language modeling: Basic usage

Retrieve the language model by specifying the "Task" parameter:

GPT2_32.png

GPT2_33.gif

Predict the next word in a given sequence:

GPT2_34.png

GPT2_35.png

Obtain the top 15 probabilities:

GPT2_36.png

GPT2_37.png

Plot the top 15 probabilities:

GPT2_38.png

GPT2_39.gif

Text generation

Define a function to predict the next token:

GPT2_40.gif

Generate the next 20 tokens by using it on a piece of text:

GPT2_41.png

GPT2_42.png

The third optional argument is a “temperature” parameter that scales the input to the final softmax. A high temperature flattens the distribution from which tokens are sampled, increasing the probability of extracting less likely tokens:

GPT2_43.png

GPT2_44.png

Decreasing the temperature sharpens the peaks of the sampling distribution, further decreasing the probability of extracting less likely tokens :

GPT2_45.png

GPT2_46.png

Very high temperature settings are equivalent to random sampling:

GPT2_47.png

GPT2_48.png

Very low temperature settings are equivalent to always picking the character with maximum probability. It is typical for sampling to “get stuck in a loop”:

GPT2_49.png

GPT2_50.png

Sentence analogies

Define a sentence embedding that consists of the last subword embedding of GPT-2 (this choice is justified by the fact that GPT-2 is a forward causal model):

GPT2_51.png

GPT2_52.gif

Define some sentences in two broad categories for comparison:

GPT2_53.png

Precompute the embeddings for a list of sentences:

GPT2_54.png

Visualize the similarity between the sentences using the net as a feature extractor:

GPT2_55.png

GPT2_56.gif

Train a classifier model with the subword embeddings

Get a text-processing dataset:

GPT2_57.gif

View a random sample of the dataset:

GPT2_58.png

GPT2_59.png

GPT2_60.png

Define a sentence embedding that consists of the last subword embedding of GPT-2 (this choice is justified by the fact that GPT-2 is a forward causal model):

GPT2_61.png

GPT2_62.gif

Precompute the GPT-2 vectors for the training and the validation datasets (if available, GPU is recommended), using the last embedded vector as a representation of the entire text:

GPT2_63.gif

Define a simple network for classification:

GPT2_64.png

GPT2_65.gif

Train the network on the precomputed GPT-2 vectors :

GPT2_66.png

GPT2_67.gif

Check the classification error rate on the validation data:

GPT2_68.png

GPT2_69.png

Compare the results with the performance of a classifier trained on context-independent word embeddings. Precompute the GloVe vectors for the training and the validation datasets (if available, GPU is recommended):

GPT2_70.png

GPT2_71.gif

Define a simple network for classification, using a max-pooling strategy:

GPT2_72.png

GPT2_73.gif

Train the classifier on the precomputed GloVe vectors:

GPT2_74.png

GPT2_75.gif

Compare the results obtained with GPT-2 and with GloVe:

GPT2_76.png

GPT2_77.png

Net information

Inspect the number of parameters of all arrays in the net:

GPT2_78.png

GPT2_79.png

Obtain the total number of parameters:

GPT2_80.png

GPT2_81.png

Obtain the layer type counts:

GPT2_82.png

GPT2_83.png

Display the summary graphic:

GPT2_84.png

GPT2_85.gif

Export to MXNet

Export the net into a format that can be opened in MXNet:

GPT2_86.png

GPT2_87.png

Export also creates a net.params file containing parameters:

GPT2_88.png

GPT2_89.png

Get the size of the parameter file:

GPT2_90.png

GPT2_91.png