
Data is everywhere… all around us, but sometimes the medium it is stored in can be a problem when analyzing it. Chances are you have a ton of data sitting around in a relational database in your current application… or you have begged, borrowed or scraped to get the data from somewhere and now you want to use Neo4j to find how this data is related.
Michael Hunger wrote a batch importer to load csv data quickly, but for some reason it hasn’t received a lot of love. We’re going to change that today and I’m going to walk you through getting your data out of tables and into nodes and edges.
Let’s clone the project and jump in.
git clone git://github.com/jexp/batch-import.git cd batch-import
It uses Maven, so if you haven’t already go ahead and install it.
sudo apt-get install maven2
Now let’s assemble the project per the instructions:
mvn clean compile assembly:single
If you did it right, you should see:
[INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESSFUL [INFO] ------------------------------------------------------------------------ [INFO] Total time: 47 seconds [INFO] Finished at: Tue Feb 28 15:50:14 UTC 2012 [INFO] Final Memory: 13M/33M [INFO] ------------------------------------------------------------------------
Awesome… let’s create some test data. Michael packed in a data generator, let’s compile it and run it.
javac ./src/test/java/TestDataGenerator.java -d . java TestDataGenerator
It will take a little while, and then you should see this:
Creating 7500000 and 41242882 Relationships took 13 seconds.
Really where?
ls -al -rw-r--r-- 1 max max 111388909 2012-02-28 16:11 nodes.csv -rw-r--r-- 1 max max 1217775358 2012-02-28 16:11 rels.csv
So what’s in nodes.csv?
head -5 nodes.csv Node Rels Property 0 4 TEST 1 0 TEST 2 1 TEST 3 1 TEST
The format is property_1, property_2, property_3 separated by tabs… and rels.csv:
head -5 rels.csv Start Ende Type Property 5496772 6842185 FIVE Property 7416995 6166503 FOUR Property 6712458 6853172 THREE Property 1291639 296708 TWO Property
The format is start node reference, end node reference number, relationship type, property_1 also separated by tabs.
Now we are ready to try out this test data. Run the command:
java -server -Xmx4G -jar target/batch-import-jar-with-dependencies.jar target/db nodes.csv rels.csv
…and go grab a soda or cup of coffee unless you happen like watching dots on the screen, as this will take a minute or 3 depending on your hardware. If you are doing this test on an EC2 c1.medium instance it ain’t gonna work (trust me I know), so do it on a box with at least 4 GB of RAM:
Importing 7500000 Nodes took 17 seconds Lots of dots.... Importing 41242882 Relationships took 164 seconds 203 seconds
Ok so where is it?
ls -al target/db -rw-r--r-- 1 max max 67500025 2012-02-28 08:58 neostore.nodestore.db -rw-r--r-- 1 max max 1998458182 2012-02-28 08:58 neostore.propertystore.db -rw-r--r-- 1 max max 1361015130 2012-02-28 08:58 neostore.relationshipstore.db ...and a bunch of other files.
Great. Now assuming you have my Neography gem installed, let’s get a fresh copy of Neo4j and put these in there.
echo "require 'neography/tasks'" >> Rakefile rake neo4j:install mv target/db neo4j/data/graph.db rake neo4j:start
Go to your Neo4j Dashboard and take a look:

Now everything should be working correctly. In part 2 of this series, I’ll show you how to write some SQL queries to get your data into Neo4j.
[...] you’ve been following along, we got Michael’s Batch Importer, compiled it, created some test data, ran it and saw [...]
[...] Batch Importer – Part 1: CSV files. [...]
[...] the end of February, we took a look at Michael Hunger’s Batch Importer. It is a great tool to load millions of nodes and relationships into Neo4j quickly. The only thing [...]
[...] recall, I’ve had three blog posts about the Batch Importer. In the first one, I showed you how to install the Batch Importer, in the second one, I showed you how to use data in your relational database to generate the csv [...]
Hi Max,
My thesis work requires filling a Neo4j server instance with at least 1M nodes(+ their relationships) as quickly as possible. (I am using Neo4j server instead of embedded as I need to communicate between servers running on different machines)
I tried REST Api Batch Ops(via Neography) but I realised that it is not the way to go. Then I found out your entry and now I am trying to use batch-importer. It works, but it takes too much time. My testbed is a AWS Large instance with 7.5GB ram, 2virtual cores.
As a comparison; you have written that “Importing 7500000 Nodes took 17 seconds”, the same value for me is 8 times larger, 138 seconds.
Batch importer is running for 2.5 hours, still puttings dots but the last and only thing it printed out was “Importing 7500000 Nodes took 138 seconds”.
Do you have any idea what slows down the operation?
Could you please your test configuration…
Thanks a lot for your great blog and for neography…
Hi Vokan,
did you ever solve this issue? I’m facing the exact same problem, I want to add a lot of data into a remote Neo4j Server instance and I don’t want to / can’t shut down the DB for that or taking the embedded approach. Did have any luck in the end?
Thanks!
Erik
Volkan,
2.5 hours? Something is not right. Do your nodes and relationships have a ton of properties? Can you check inside the graph.db folder being created and see the file sizes growing? Are you indexing (that’s a bit slower than creating nodes and relationships)? Post your answers on the neo4j google forum and we’ll figure this out.
Thanks,
Max
Hi Max,
I was trying to install this using maven as your instructions suggest but I’m getting the following error:
C:\Users\GBS\git\batch-import>mvn clean compile assembly:single
[INFO] Scanning for projects…
[INFO]
[INFO] ————————————————————————
[INFO] Building Simple Batch Importer 0.1-SNAPSHOT
[INFO] ————————————————————————
[WARNING] The POM for org.neo4j:neo4j-kernel:jar:1.8-SNAPSHOT is missing, no dependency information available
[WARNING] The POM for org.neo4j:neo4j-lucene-index:jar:1.8-SNAPSHOT is missing, no dependency information available
[INFO] ————————————————————————
[INFO] BUILD FAILURE
[INFO] ————————————————————————
[INFO] Total time: 0.453s
[INFO] Finished at: Sat Oct 20 18:38:24 EDT 2012
[INFO] Final Memory: 6M/77M
[INFO] ————————————————————————
[ERROR] Failed to execute goal on project batch-import: Could not resolve dependencies for project org.neo4j:batch-impor
t:jar:0.1-SNAPSHOT: The following artifacts could not be resolved: org.neo4j:neo4j-kernel:jar:1.8-SNAPSHOT, org.neo4j:ne
o4j-lucene-index:jar:1.8-SNAPSHOT: Failure to find org.neo4j:neo4j-kernel:jar:1.8-SNAPSHOT in http://m2.neo4j.org/conten
t/repositories/snapshots was cached in the local repository, resolution will not be reattempted until the update interva
l of Neo4j Snapshots has elapsed or updates are forced -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
C:\Users\GBS\git\batch-import>
I’ve searched and can’t seem to find anything concerning the error above. Hopefully you can point me in the right direction.
Thanks
Open up the pom.xml and change the two 1.8-SNAPSHOT to 1.8. Michael will update his repo shortly.
Thanks Max! That got it
Hi
Tried to compile TestDataGenerator. Initially couldn’t find the file, then found it in /src/test/java/org/neo4j/batchimport/TestDataGenerator.java.
But then got compile errors:
./src/test/java/org/neo4j/batchimport/TestDataGenerator.java:3: package org.junit does not exist
import org.junit.Ignore;
^
./src/test/java/org/neo4j/batchimport/TestDataGenerator.java:14: cannot find symbol
symbol: class Ignore
@Ignore
^
Help please!!
Enzo
Enzo,
Can you post your error to https://groups.google.com/forum/?fromgroups#!forum/neo4j ?
We can better help you there.
Regards,
Max
[...] are many technical tools out there (definitely look here, here and here, but I needed something simple. So my friend and colleague Michael Hunger came to the [...]
Hi,
is there a release (JAR file) available somewhere? Building it is such a pain…thanks!
You can grab this one from my public dropbox => https://dl.dropbox.com/u/57740873/batch-import-jar-with-dependencies.jar
Thanks! For those who rarely use Maven projects it’s a real help
Anyway here is how to do with Netbeans:
- clone the project
- open it in Netbeans
- right-click on the project name, select Properties, then the Actions panel
- select Build with Dependencies and add this goal to the Execute Goals settings: ‘assembly:single’
- add also the property Skip Tests
- press OK
- right-click again on the project name, select Resolve Problems on the bottom to download the dependencies.
- right-click again and select Build with Dependencies
cheers,
Seb
[...] so instead of typing out a million node graph, we’ll build a graph generator and use the batch importer to load it into Neo4j. What I want to create is a set of files to feed to the batch-importer. A [...]
[...] 有需要技术教程教我们如何做(比如batch-import,batch importer part,import),但我需要一些简单的方法,所以我的朋友和同事 Michael Hunger前来帮助我,提供了一些方法用于创建一个Excel将数据导入Neo4j. [...]
Max, thanks for all these tutorials. Have you noticed that batch-import tool does not support UTF-8 encoding? No accents, no non-English characters at all, this is a massive problem for many of us. I have already raised the issue in github, do you have any idea how to make it work?