Importing Wikipedia into Neo4j with Graphipedia

Wouldn’t it be cool to import Wikipedia into Neo4j?

Mirko Nasato thought so, and built graphipedia using the batch importer that does just that.

It’s written in Java, so if you’re a pure ruby guy, I’ll walk you through the steps.

Let’s clone the project and jump in.

git clone git://github.com/mirkonasato/graphipedia.git
cd graphipedia

If you look in here you’ll see a pom.xml file which means you’ll need to download Maven and build the project.

sudo apt-get install maven2
mvn install

You’ll see a bunch of stuff flying by, that’s just the dependencies being downloaded. At the end you should see this:

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] ------------------------------------------------------------------------
[INFO] Graphipedia Parent .................................... SUCCESS [1:08.932s]
[INFO] Graphipedia DataImport ................................ SUCCESS [1:16.018s]
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESSFUL
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 2 minutes 25 seconds
[INFO] Finished at: Thu Feb 16 11:36:55 CST 2012
[INFO] Final Memory: 28M/434M
[INFO] ------------------------------------------------------------------------

Ok, so now let’s get the file from wikipedia we need. You can download it with wget.

wget http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Whoa, hold up. That’s a 7.6 G file… can we try a smaller data set first?

Sure. Let’s go with Lea faka-Tonga ’cause it just sounds cool…and we’ll unzip it.

wget http://dumps.wikimedia.org/towiki/latest/towiki-latest-pages-articles.xml.bz2
bzip2 -d towiki-latest-pages-articles.xml.bz2

It is a two step process, so first lets create a smaller intermediate XML file containing page titles and links only:

java -classpath ./graphipedia-dataimport/target/graphipedia-dataimport.jar org.graphipedia.dataimport.ExtractLinks towiki-latest-pages-articles.xml towiki-links.xml

You should see:

Parsing pages and extracting links...
..
2835 pages parsed in 0 seconds.

Then we run the batch importer on this file and dump the contents on to the graphdb directory:

java -Xmx3G -classpath ./graphipedia-dataimport/target/graphipedia-dataimport.jar org.graphipedia.dataimport.neo4j.ImportGraph towiki-links.xml graph.db

You should see:

Importing pages...
..
2835 pages imported in 0 seconds.
Importing links...
.....
5799 links imported in 0 seconds; 6383 broken links ignored

Go inside and take a look and you’ll see our neostore files.

cd graph.db
ls

You can copy this folder over any existing neo4j database by overwriting the /neo4j/data/graph.db folder and enjoy.

Tagged , , ,

5 thoughts on “Importing Wikipedia into Neo4j with Graphipedia

  1. Demetrius says:

    $ java -classpath ./graphipedia-dataimport/target/graphipedia-dataimport.jar org.graphipedia.dataimport.ExtractLinks towiki-latest-pages-articles.xml towiki-links.xml

    Exception in thread “main” java.lang.NoClassDefFoundError: org/graphipedia/dataimport/ExtractLinks
    Caused by: java.lang.ClassNotFoundException: org.graphipedia.dataimport.ExtractLinks
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

    Love the post, trying to follow along. I’m on OSX 10.7 and get the above error. Any Ideas?

    • maxdemarzi says:

      Hum… make sure after you run “mvn install” that the target directory exists and the jar files are there.

      Then run the command from the graphipedia dir, not the graphipedia-dataimport dir.

  2. Esther D. says:

    How do I run this code on Eclipse so I do not have to install maven?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: