The Microsoft IsA concept graph is a collection of facts based on billions of web-pages which allows you to infer/deduce things. It’s one of the many AI efforts Microsoft is pushing and the IsA data is free to download.

For example, a ‘company’ is a concept and ‘Apple’ is an instance of this concept. Similarly ‘Florida’ is an instance of ‘State’. We, human beings, are crack-full with such semantic relationships and it helps us to make a mental model of the world around us. If someone speaks of Microsoft we immediately infer ‘company’ and see Microsoft as part of a larger group from which we deduce characteristics (and opinions). Creating such a network from the web is not an easy task and I think it’s a great thing Microsoft is making this data freely available.

The data is a flat file containing rows like

Microsoft   Company 4587
Apple       Fruit   741

the number being an indication of how strong the relationship is. Indeed, ‘apple’ can both refer to a ‘fruit’ and a ‘company’. The term ‘apple’ is an instance of a fruit and an instance of a company. The strength of the relationship is based of how often ‘apple’ appears in the corresponding context. So, automatically picking up ‘apple’ in a piece of text and inferring that it’s a ‘company’ requires more than just the network, you need metadata. In this sense, the current concept graph is an intermediate results and the roadmap makes this clear.

Having a 2Gb text-file with all this knowledge is obviously not convenient and begs for a graph database. So I picked up Neo4j (as the de facto graph db nowadays) to see what it takes. The result is that it’s really easy, technically speaking, to import this kind of semantic data. On the other hand, the concept graph is much like a social network with big hubs (high-centrality nodes) and it means that whatever you ask the network you always end up with a bunch of main nodes. In a social network you inevitably encounter Justin Bieber, in a concept graph you inevitably encounter ‘concept’ or ‘factor’. Obviously, everything is a ‘concept’ and you hence can link everything to this node. Meaning that asking questions like

how is 'dog' connected to 'quantum mechanics'?

should in principle return an interesting path but in reality are linked by just one ‘concept’ node. If you would use the concept graph for an AI-like chat you would return the obvious. So, it seems at this stage the aforementioned metadata is more than just v2, it’s a crucial next step to use the graph for something meaningful beyond the obvious.

Aside from the less-than-expected usefulness, the experience with Neo4j is great. Version 3 is in comparison to prior versions a big leap forward, Neo4j is now a mature piece of software with a great management interface. The cypher query language is still a bit odd to use but you get the hang of it in no time. Access via NodeJS and Python is straightfoward.

So, how to load a conceptual universe in a dozen lines of cypher? You need to download and install Neo4j. On Mac it used to be a ‘brew’ service setup but it’s now a clean app. Love it.

To make sure the concepts and instances are unique you need to add a unique constraint before starting to load the data:



If you need to drop an index you can use

DROP INDEX ON :Concept(name)

For reference, here is the way you would create a standard index:

CREATE INDEX ON :Concept(name);

Note that the integrated tutorials in the Neo4j management UI are fun and useful. For example, the shortest path example in the movie database is directly applicable to the concept graph.

To import the large concept graph you can proceed in two ways: loop over the CSV lines and use the MERGE command of cypher or you can use the super-fast csv import. Below you find both. I used Python but the approach would be almost identical in Java or NodeJS.

import pandas as pd

This loads the first 1000 lines in one go:

df1000 = pd.read_csv("/data/data-concept-instance-relations.txt", nrows=1000, sep="\t", header=None, names=['concept','instance','count'])

You can also load/filter things out by means of the csv-iterator:

# there are 7433 instances of 'state'
iter_csv = pd.read_csv("/data/data-concept-instance-relations.txt", iterator=True, chunksize=1000, sep="\t", header=None, names=['concept','instance','count'])
df = pd.concat([chunk[chunk['concept'] =='state'] for chunk in iter_csv])


# there are 7433 instances of 'state'
iter_csv = pd.read_csv("/data/data-concept/data-concept-instance-relations.txt", iterator=True, chunksize=10000, sep="\t", header=None, names=['concept','instance','count'])
df = pd.concat([chunk[chunk['instance'] =='texas'] for chunk in iter_csv])

The first twenty ‘texas’ records are:

df.sort_values(by='count', ascending=False).head(20)
concept instance count
16 state texas 8056
9809 place texas 263
1351 area texas 235
2005 city texas 226
5256 jurisdiction texas 128
6232 large state texas 125
7504 southern state texas 120
3540 school texas 84
5845 region texas 61
5990 market texas 61
2177 community texas 57
3668 case texas 56
4589 community property state texas 56
804 location texas 52
4591 border state texas 51
6510 populous state texas 50
3975 u s state texas 47
5330 team texas 46
6424 company texas 37
6920 program texas 37

This loads the whole lot:

df = pd.read_csv("/data/data-concept-instance-relations.txt", sep="\t", header=None, names=['concept','instance','count'])

If you loop over the first 10K items and merge them into the database you discover that things are a bit slow:

import time
import re
from neo4j.v1 import GraphDatabase, basic_auth
start = time.time()
driver = GraphDatabase.driver("bolt://localhost", auth=basic_auth("neo4j", "goforit"))
session = driver.session()
for index, row in df.head(10000).iterrows():
    create = "MERGE (c:Concept {name:\"" + row['concept'] + "\"}) MERGE (i:Instance {name:\"" + row['instance'] + "\"}) MERGE (i)-[:isa {order:" +  str(row['count']) + "}]->(c)"
print("> Loading done in %.2f seconds." % (time.time() - start))

> Loading done in 40.87 seconds.

That’s like 10K per minute, too slow really.
So, let’s try with the fast LOAD CSV method. As a side-note, the big file should either be inside the import directory or you need to comment out the line

There is a nice collection of tips related to csv import here if you want to see how the import script can be fine-tuned for best results.

In the end, the cypher script to import the csv in one go is that simple:

    LOAD CSV FROM "file:///data/data-concept-instance-relations.txt" as row
        with row LIMIT 100
        MERGE (c:Concept {name: row[0]}) MERGE (i:Instance {name:row[1]}) MERGE (i)-[:isa]->(c)

Added 7001 labels, created 7001 nodes, set 7001 properties, created 11900 relationships, statement executed in 10126 ms.

The loading of 10K happens this time in 10 seconds, so definitely a faster approach. Still, loading the whole lot takes more than an hour and loads millions of concepts and instances. On my Mac I ended up with a graph database of 15Gb so make sure you have plenty of space.

As mentioned above, you can immediately ask interesting questions like ‘what concepts are related to Rembrandt?’ or ‘what is the shortest path between algebra and red flowers?’.

   match (i:Instance)-[k]->() where =~ '(?i).*Rembrandt.*'  return k
    match p = shortestPath((i:Concept {name:"flower"})-[*]-(j:Instance {name:"equation"}) ) return p

The results are sometimes useful but often obvious. Maybe the obvious results are a proof that it works in the sense that you do want to see ‘painter’ connected to ‘Rembrandt’. At the same time, you only see what you already know.

Quantum dogs

Some queries run indefinitely like asking a subset of how things are connected.

   $MATCH (a)-[r]->(b) WHERE labels(a) <> [] AND labels(b) <> [] RETURN DISTINCT head(labels(a)) AS This, type(r) as To, head(labels(b)) AS That LIMIT 10

By the way, there is a useful cypher cheat-sheet here. All the queries I launched returned short paths and this results from (as mentioned above) the presence of high-centrality hubs in the network. This is true in our own brain as well; in the end everything around you qualifies as ‘stuff’ or ‘a thing’. So, no surprise there but it makes the concept graph immediately less interesting. It takes more than a big concept graph and a great graph database system to return smart answers. Ultimately this is however a crucial part of a true AI as much as personalization, memory and neural learning.