Wouldn’t it be cool to import Wikipedia into Neo4j?
Mirko Nasato thought so, and built graphipedia using the batch importer that does just that.
It’s written in Java, so if you’re a pure ruby guy, I’ll walk you through the steps.
Let’s clone the project and jump in.
git clone git://github.com/mirkonasato/graphipedia.git cd graphipedia
If you look in here you’ll see a pom.xml file which means you’ll need to download Maven and build the project.
sudo apt-get install maven2 mvn install
You’ll see a bunch of stuff flying by, that’s just the dependencies being downloaded. At the end you should see this:
[INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary: [INFO] ------------------------------------------------------------------------ [INFO] Graphipedia Parent .................................... SUCCESS [1:08.932s] [INFO] Graphipedia DataImport ................................ SUCCESS [1:16.018s] [INFO] ------------------------------------------------------------------------ [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESSFUL [INFO] ------------------------------------------------------------------------ [INFO] Total time: 2 minutes 25 seconds [INFO] Finished at: Thu Feb 16 11:36:55 CST 2012 [INFO] Final Memory: 28M/434M [INFO] ------------------------------------------------------------------------
Ok, so now let’s get the file from wikipedia we need. You can download it with wget.
wget http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
Whoa, hold up. That’s a 7.6 G file… can we try a smaller data set first?
Sure. Let’s go with Lea faka-Tonga ’cause it just sounds cool…and we’ll unzip it.
wget http://dumps.wikimedia.org/towiki/latest/towiki-latest-pages-articles.xml.bz2 bzip2 -d towiki-latest-pages-articles.xml.bz2
It is a two step process, so first lets create a smaller intermediate XML file containing page titles and links only:
java -classpath ./graphipedia-dataimport/target/graphipedia-dataimport.jar org.graphipedia.dataimport.ExtractLinks towiki-latest-pages-articles.xml towiki-links.xml
You should see:
Parsing pages and extracting links... .. 2835 pages parsed in 0 seconds.
Then we run the batch importer on this file and dump the contents on to the graphdb directory:
java -Xmx3G -classpath ./graphipedia-dataimport/target/graphipedia-dataimport.jar org.graphipedia.dataimport.neo4j.ImportGraph towiki-links.xml graph.db
You should see:
Importing pages... .. 2835 pages imported in 0 seconds. Importing links... ..... 5799 links imported in 0 seconds; 6383 broken links ignored
Go inside and take a look and you’ll see our neostore files.
cd graph.db ls
You can copy this folder over any existing neo4j database by overwriting the /neo4j/data/graph.db folder and enjoy.
$ java -classpath ./graphipedia-dataimport/target/graphipedia-dataimport.jar org.graphipedia.dataimport.ExtractLinks towiki-latest-pages-articles.xml towiki-links.xml
Exception in thread “main” java.lang.NoClassDefFoundError: org/graphipedia/dataimport/ExtractLinks
Caused by: java.lang.ClassNotFoundException: org.graphipedia.dataimport.ExtractLinks
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Love the post, trying to follow along. I’m on OSX 10.7 and get the above error. Any Ideas?
Hum… make sure after you run “mvn install” that the target directory exists and the jar files are there.
Then run the command from the graphipedia dir, not the graphipedia-dataimport dir.
Nailed it! “Brew install maven”, doesn’t install maven for you apparently :P
[…] This blog helped […]
How do I run this code on Eclipse so I do not have to install maven?