Node2Vec embedding

Embedding of nodes happens via word2vec by means of a smart trick: using randomg walks over the graph to generate ‘word’ sequences.
Stellargraph has its own direct method to perform the embedding but the intermediate methods highlights better the process. So, below we generate the node2vec embedding via an explicit walk and show how it generates a really good community detection separation.

    import networkx as nx
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.manifold import TSNE
    %matplotlib inline

We’ll use the karater club to demonstrate the process. The graph consists of two sets of nodes which are a well-separated according to the ‘club’ property.

    g_nx = nx.karate_club_graph()
    cols = ["green" if g_nx.nodes[n]["club"]=='Officer' else "orange" for n in g_nx.nodes()]
    nx.draw(g_nx, node_color=cols)

png
From this graph we create a Stellargraph and perform a biased random walk on it. This generates word sequences, in this case the string value of the node index.

    from stellargraph.data import BiasedRandomWalk
    from stellargraph import StellarGraph
    from gensim.models import Word2Vec
    rw = BiasedRandomWalk(StellarGraph(g_nx))
    walks = rw.run(
          nodes=list(g_nx.nodes()), # root nodes
          length=100,  # maximum length of a random walk
          n=10,        # number of random walks per root node
          p=0.5,       # Defines (unormalised) probability, 1/p, of returning to source node
          q=2.0        # Defines (unormalised) probability, 1/q, for moving away from source node
    )
    walks = [list(map(str, walk)) for walk in walks]
    model = Word2Vec(walks, size=128, window=5, min_count=0, sg=1, workers=2, iter=1)

The value of an embedding is for instance

    model.wv['29']
array([ 0.0283457 ,  0.06906749, -0.09740856,  0.08761664,  0.0240158 ,
       -0.04252268,  0.05366189,  0.12255755, -0.14192946, -0.12441556,
        0.14022443,  0.16821992,  0.01899681,  0.02525605, -0.129657  ,
       -0.00075872, -0.10963597, -0.24603637,  0.14481993,  0.04069758,
       ...
       -0.03686432,  0.28888953,  0.06754036], dtype=float32)

In order to visualize the embedding one has to somehow reduce the dimension. This is most easily done via t-SNE.

    # Retrieve node embeddings and corresponding subjects
    node_ids = model.wv.index2word  # list of node IDs
    node_embeddings = model.wv.vectors  # numpy.ndarray of size number of nodes times embeddings dimensionality
    node_targets = [ g_nx.node[int(node_id)]['club'] for node_id in node_ids]
    # Apply t-SNE transformation on node embeddings
    tsne = TSNE(n_components=2)
    node_embeddings_2d = tsne.fit_transform(node_embeddings)
    alpha=0.9
    label_map = { l: i for i, l in enumerate(np.unique(node_targets))}
    node_colours = [ label_map[target] for target in node_targets]
    plt.figure(figsize=(10,8))
    plt.scatter(node_embeddings_2d[:,0],
                node_embeddings_2d[:,1],
                c=node_colours, cmap="jet", alpha=alpha)
<matplotlib.collections.PathCollection at 0x1463cfc50>

png
This looks like a clean separation indeed. The splitting is not 100% correct though, just by looking at the corresponding value of the ‘club’ property.

    [g_nx.nodes[i] for i,v in enumerate(node_colours) if v==1]
[{'club': 'Mr. Hi'},
 {'club': 'Mr. Hi'},
 {'club': 'Mr. Hi'},
 {'club': 'Officer'},
 {'club': 'Mr. Hi'},
 {'club': 'Officer'},
 {'club': 'Officer'},
 {'club': 'Officer'},
 {'club': 'Mr. Hi'},
 {'club': 'Officer'},
 {'club': 'Officer'},
 {'club': 'Officer'},
 {'club': 'Officer'},
 {'club': 'Officer'},
 {'club': 'Officer'},
 {'club': 'Officer'},
 {'club': 'Officer'}]

Five out of senteen are incorrect. This is still remarkable considering the fact that node2vec process did not know anything at about the ‘club’ property but that it’s an emergent feature of the embedding.