
If you’ve been following along, we got Michael’s Batch Importer, compiled it, created some test data, ran it and saw millions of nodes and relationships loaded into Neo4j.
So now we’re ready for our own data. I am going to show you how to get data from a Relational Database like PostgreSQL into a format we can use. If you’re using SQL Server, MySQL, Oracle, etc, the directions will be slightly different, but you’ll get the picture.
So let’s say we have an identities table with the following layout:
Table "public.identities"
Column | Type | Modifiers
----------------------+-----------------------------+-------------------------------------
id | character varying(36) | not null default uuid_generate_v4()
nickname | character varying(255) |
We need to assign reference ids to the items in the tables that will become our nodes. The row_number window function is perfectly suited for this trick.
SELECT row_number() OVER (ORDER BY id), id, nickname FROM identities ORDER BY id LIMIT 10;
row_number | id | nickname
------------+--------------------------------------+------------
1 | 00004b82-86de-4757-bf9e-9bebabbd234e | clkao
2 | 0000539c-aec2-45ba-b353-22fad81c1ae2 | Erre
3 | 00006460-2834-466a-88b4-7583bdf3142f | nanha
4 | 0000b675-e9fc-4091-bb2c-f2e6388220b9 | joemocha
5 | 0001951c-03dd-4c70-8e67-ff8ffb11c628 | BlueHotDog
Let’s go ahead and create a new table with this data by using the INTO command:
SELECT row_number() OVER (ORDER BY id) as node_id, id, nickname INTO iden_nodes FROM identities ORDER BY id;
Getting the nodes was easy enough… how about relationships? In our database we have a join table called vouches where our identities relate to each other.
Table "public.vouches"
Column | Type | Modifiers
----------------+-----------------------------+-------------------------------------------------
id | character varying(36) | not null default uuid_generate_v4()
requester_id | character varying(36) | not null
grantor_id | character varying(36) | not null
term_id | character varying(36) | not null
status | character varying(10) | not null default 'requested'::character varying
Grantors vouch for Requesters using a Term. The name of the Term is stored in another table, but the relationship has a property called status. Grantors and Requesters are Identities and we need to join them to their reference ids we created earlier. The format we want is start – end – relationship type – relationship property_1…
So we can build that with:
SELECT FN.node_id AS start, TN.node_id AS end, T.name AS type, V.status
INTO vouch_rels
FROM vouches AS V
INNER JOIN iden_nodes AS FN ON V.grantor_id = FN.id
INNER JOIN iden_nodes AS TN ON v.requester_id = TN.id
INNER JOIN terms AS T ON V.term_id = T.id
ORDER BY grantor_id, requester_id;
start | end | type | status
-------+-------+------------+-----------
1 | 1 | Perl | confirmed
1 | 1 | JavaScript | confirmed
1 | 4096 | Perl | confirmed
1 | 12338 | JavaScript | confirmed
1 | 12338 | Perl | confirmed
In this case we have about 75 terms that I will use as relationship types. If I had 40,000 terms, I would use a type of “vouched_for” and make the term name one of the properties of the relationship. Now we can get the data out of these tables and into files the batch importer can use.
psql -c "copy (SELECT id, nickname FROM iden_nodes ORDER BY node_id) TO stdout csv header delimiter E'\t'" -o mynodes.csv -d prod_getvouched
The command above will create a mynodes.csv file that looks like:
head -5 mynodes.csv id nickname 00004b82-86de-4757-bf9e-9bebabbd234e clkao 0000539c-aec2-45ba-b353-22fad81c1ae2 Erre 00006460-2834-466a-88b4-7583bdf3142f nanha 0000b675-e9fc-4091-bb2c-f2e6388220b9 joemocha
Notice we did not export the reference number since we do not need to have it as a property of our nodes. Now let’s get relationships.
psql -c "copy (SELECT start, \"end\", type, status FROM vouch_rels ORDER BY start) TO stdout csv header delimiter E'\t'" -o myrels.csv -d prod_getvouched
We had to escape the end column since it is a keyword, and myrels.csv contains:
head -5 myrels.csv start end type status 1 1 JavaScript confirmed 1 1 Perl confirmed 1 12338 Perl confirmed 1 25879 Perl confirmed
Let’s go ahead and stop the Neo4j instance we have running and delete it’s database.
rake neo4j:stop rm -rf neo4j/data/graph.db
Then we’ll run our batch import command putting the database in neo4j directly.
java -server -Xmx4G -jar target/batch-import-jar-with-dependencies.jar neo4j/data/graph.db mynodes.csv myrels.csv
…and we’ll go get another soda…whoops, no time, it’s done.
.. Importing 295951 Nodes took 1 seconds ............ Importing 1218043 Relationships took 4 seconds 9 seconds
…and we’ll restart our Neo4j server.
rake neo4j:start
Then we fire up our Neo4j Data Browser:

Awesome. This works great for social networks where the nodes are all in the same table…but what about if you are building a recommendation system and you have identities and items as nodes?
No problem. You follow the same technique to create your first nodes table and then…
SELECT max(node_id) from iden_nodes; 295951 SELECT 295951 + row_number() OVER (ORDER BY id), id, name, price FROM items ORDER BY id;
You just start from where your other node table left off. You can use UNION ALL to put them together on your export.
Give it a shot and let me know if you run into any trouble and I’ll be glad to help.
Thanks for the blog posts.
The best follow-up post would be to access the postgresql db from java via jdbc and pumping that data directly into neo4j.
Michael,
I’m trying to import data using Spring Data Neo4j, because I want to let the graph database prepared for use with Spring Data Neo4j because the code is more simple, readable and easy to mantain, but I have a serious performance problems, specially when I look up for a node to set a relation.
Could you recommend me what I should do if I want to let the graph database prepared for use it in Spring Data Neo4j and improve considerably the import performance?
[...] everything should be working correctly. In part 2 of this series, I’ll show you how to write some SQL queries to get your data into Neo4j. Share [...]
This is great. What are you opinions on the correct way to keep the two databases synchronized several times an hour?
[...] Batch Importer – Part 2: Use of SQL to prepare files for import. [...]