Knowledge Bases in Neo4j

cnet5promo

From the second we are born we are collecting a wealth of knowledge about the world. This knowledge is accumulated and interrelated inside our brains and it represents what we know. If we could export this knowledge and give it to a computer, it would look like ConceptNet. ConceptNet is a semantic network that…

…is built from nodes representing concepts, in the form of words or short phrases of natural language, and labeled relationships between them. These are the kinds of things computers need to know to search for information better, answer questions, and understand people’s goals.


I wrote a little ruby script to import ConceptNet5 into Neo4j and it gives us a nice graph (243MB) to work with. ConceptNet5 as presented in csv files is actually a hypergraph, with a reason for the concept:

/a/[/r/NotHasProperty/,/c/en/old_map/,/c/en/very_accurate/]     /r/NotHasProperty       /c/en/old_map   /c/en/very_accurate     /ctx/all
        -1      /s/activity/omcs/vote,/s/contributor/omcs/PJ    /e/e529e3a070783cbbe212bc5e721b6938c0a6df6b     /d/conceptnet/4/en      [[Old maps]] are not [[very accurate]]
/a/[/r/NotHasProperty/,/c/en/old_map/,/c/en/very_accurate/]     /r/NotHasProperty       /c/en/old_map   /c/en/very_accurate     /ctx/all
        -1      /s/activity/omcs/vote,/s/contributor/omcs/aghanford     /e/a8ecaed55f5ffba88b6d02da99ecf3fe42bffe55     /d/conceptnet/4/en
      [[Old maps]] are not [[very accurate]]

Here two contributors let us know that old maps are not very accurate. That’s great to know, but we don’t really need to represent this twice in our graph. So instead we capture and ignore duplicate relationships by using a bloom filter to check for their existence.

@edge_bf = BloomFilter::Native.new(:size => 212000000, :hashes => 23, :bucket => 8, :raise => false)

def is_unique_rel(from,to,rel)
  return false if @edge_bf.include?("#{from}-#{to}-#{rel}")
  @edge_bf.insert("#{from}-#{to}-#{rel}")
  true
end

Once it’s all set and done, we end up with about 2.5 million nodes and 7.5 million relationships:

conceptnet5

For example, let’s see everything ConceptNet5 knows about Sushi:

sushi

START sushi=node:Concepts(id="/c/en/sushi")
MATCH sushi-[r]-other_concepts
RETURN sushi.id, TYPE(r), other_concepts.id

We imported all of the concepts in to a “Concepts” index to make the graph easy to work with.
Here we are asking for all other concepts connected to the sushi concept, and asking the graph to tell us what type of relationship exists between them.

==> +--------------------------------------------------------+
==> | TYPE(r)           | other_concepts.id                  |
==> +--------------------------------------------------------+
==> | "MadeOf"          | "/c/en/raw_fish"                   |
==> | "MotivatedByGoal" | "/c/en/eat_in_restaurant"          |
==> | "AtLocation"      | "/c/en/japanese_restaurant"        |
==> | "HasProperty"     | "/c/en/delicious"                  |
==> | "HasProperty"     | "/c/en/japanese_in_origin"         |
==> | "IsA"             | "/c/en/asian_food"                 |
==> | "IsA"             | "/c/en/from_japan"                 |
==> | "IsA"             | "/c/en/japanese_food"              |
==> | "IsA"             | "/c/en/food"                       |
==> | "IsA"             | "/c/en/fish"                       |
==> | "NotIsA"          | "/c/en/raw_fish"                   |
==> | "CapableOf"       | "/c/en/consist_mainly_of_raw_fish" |
==> | "ReceivesAction"  | "/c/en/eat_by_many_westerner"      |
==> +--------------------------------------------------------+

The results are quite interesting. Our graph knows it’s made of raw fish, eaten in a restaurant, specifically a Japanese restaurant (hard to find sushi at an Italian or Indian restaurant). The graph thinks sushi is delicious (I would agree, but some folks would violently disagree). Notice also that it has a link to “NotIsA” raw_fish and a link to “consists_mainly_of_raw_fish”, so our graph is smart enough to know that some sushi is not raw.

If you ever happen to stop by the Neo4j office in San Mateo, CA, you’ll want to go to Sushi Sams for the best Sushi in San Mateo. Let’s see what else it thinks is delicious:

START delicious=node:Concepts(id="/c/en/delicious")
MATCH delicious-[r]-other_concepts
RETURN TYPE(r), other_concepts.id
==> +------------------------------------+
==> | TYPE(r)       | other_concepts.id  |
==> +------------------------------------+
==> | "IsA"         | "/c/en/single"     |
==> | "NotIsA"      | "/c/en/nutricious" |
==> | "HasProperty" | "/c/en/ice_cream"  |
==> | "HasProperty" | "/c/en/atangerine" |
==> | "HasProperty" | "/c/en/banana"     |
==> | "HasProperty" | "/c/en/chicken"    |
==> | "HasProperty" | "/c/en/chocolate"  |
==> | "HasProperty" | "/c/en/beef"       |
==> | "HasProperty" | "/c/en/fruit"      |
==> | "HasProperty" | "/c/en/butter"     |
==> | "HasProperty" | "/c/en/meat"       |
==> | "HasProperty" | "/c/en/cake"       |
==> | "HasProperty" | "/c/en/sushi"      |
==> | "HasProperty" | "/c/en/marmite"    |
==> | "HasProperty" | "/c/en/cheese"     |
==> | "HasProperty" | "/c/en/tortilla"   |
==> +------------------------------------+

Anything that is not “nutricious” (they probably meant nutritious ) is not delicious. I agree with most other things on here… but marmite? Seriously.

marmite-404_685611c

If you want to tackle something a bit bigger, you can look at the Yago Knowledge Base which has 10 million entities (like persons, organizations, cities, etc.) and contains more than 120 million facts about these entities.

yago_logo_mainpage

Tagged , , , , , , , , ,

13 thoughts on “Knowledge Bases in Neo4j

  1. Rik Van Bruggen says:

    START
    sushi=node:Concepts(id=”/c/en/sushi”),
    beer=node:Concepts(id=”/c/en/beer”)
    MATCH
    p = AllShortestPaths(sushi-[*..10]-beer)
    return p;

  2. soujanya poria says:

    hey…no chance to retrieve the weight of an assertion ?

    • maxdemarzi says:

      Sure. The weights are in the properties.

      • Thanks for your reply. But I faced problem while developing a spanish conceptnet graph. The graph should contain only spanish concepts and assertions along with the frequency and weight. Any thought on it? Also, looking forward to have a python api to access the database.

        How to load the database?

      • actually professor, i want to obtain a concept by concept matrix in Spainsh. Where each cell of the matrix contains weight of the edge between them. Can you please let me know the steps how to obtain that?
        Thanks in advance.

      • this code returns 0 rows :(

        START table=node:Concepts(id=”/c/en/table/”)
        MATCH table-[r]-other_concepts
        RETURN table.id, TYPE(r), other_concepts.id

      • sorry…previous reply was a mistake..but please let me know about retrieving spanish concepts

  3. Rob Speer says:

    That’s great!

    I’m not sure if you know this, but I had *originally* designed ConceptNet 5 data to be indexed in Neo4j, instead of in Solr like it is now. I loved the idea of actually representing the graph as a graph. However, the tools for inserting bulk data into Neo4j just weren’t there at the time. It’s great to see that this is now possible.

    There’s a new release of ConceptNet 5, with multilingual data that is both expanded and cleaned up a bit, if you want it: http://conceptnet5.media.mit.edu/downloads/20130529/

  4. adam says:

    Pretty awesome, Max, as always. How does Neo handle the conversion from hypergraph to property graph? I can’t read Ruby very well—is it converting hyperedges into multiple single edges or introducing intermediate nodes? Or just using multiple labels on the edges?

    • maxdemarzi says:

      I just ignore them. For the actual use of the knowledge, we don’t really need to know what user added the content, just the content itself.

  5. QuestionGuy says:

    Maybe it’s just me, but i can’t find the ‘weight’ (or any other) property within the graph nodes.
    The only property I can see is the “id” …. Any thoughts?

    P.S.
    I’m using the downloaded graph for now, I’m too lazy to (re)install Ruby … Should I try with the csv import?

    Cheers

  6. Olivier says:

    Hello Max and thank you for your awesome work :)

    I’m a novice trying to import ConceptNet 5.2 with your ruby script. Here is the output of the “rake neo4j:load” :

    ——————————————–
    Running the following:
    java -server -Xmx4G -jar ./../batch-import/target/batch-import-jar-with-dependencies.jar neo4j/data/graph.db nodes.csv edges.csv node_index Concepts exact nodes_index.csv
    Usage: Importer data/dir nodes.csv relationships.csv [node_index node-index-name fulltext|exact nodes_index.csv rel_index rel-index-name fulltext|exact rels_index.csv ....]
    Using: Importer neo4j/data/graph.db nodes.csv edges.csv node_index Concepts exact nodes_index.csv

    Using Existing Configuration File
    …………..
    Importing 1456223 Nodes took 2301 seconds
    …….
    Total import time: 10171 seconds
    Exception in thread “main” org.neo4j.graphdb.NotFoundException: id=1488418
    at org.neo4j.unsafe.batchinsert.BatchInserterImpl.getNodeRecord(BatchInserterImpl.java:917)
    at org.neo4j.unsafe.batchinsert.BatchInserterImpl.createRelationship(BatchInserterImpl.java:471)
    at org.neo4j.batchimport.Importer.importRelationships(Importer.java:128)
    at org.neo4j.batchimport.Importer.doImport(Importer.java:206)
    at org.neo4j.batchimport.Importer.main(Importer.java:77)

    ——————————————–

    For me the import looks complete because I see the total import time, but I only have 1456233 nodes instead of 2500000 as per your article.

    On the other side the generated graph.db size is around 250 MB, which is larger than the one you provide, so it looks ok.

    And when I try the sushis queries (miam), I get 0 results.

    I’ve fetched the CSV from http://conceptnet5.media.mit.edu/downloads/current/conceptnet5_csv_20130917.tar.bz2, size is 176,3 MB.

    The size is relatively small compared to other formats available, maybe the 5.2 CSV archive is uncomplete hence the exception ? It contains parts named part_12.csv to part19.csv.

    Can you confirm the problem is the current CSV archive, or am I doing it wrong ?

    Thank you !

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,678 other followers

%d bloggers like this: