Batch Importer – Part 2

If you’ve been following along, we got Michael’s Batch Importer, compiled it, created some test data, ran it and saw millions of nodes and relationships loaded into Neo4j.

So now we’re ready for our own data. I am going to show you how to get data from a Relational Database like PostgreSQL into a format we can use. If you’re using SQL Server, MySQL, Oracle, etc, the directions will be slightly different, but you’ll get the picture.

So let’s say we have an identities table with the following layout:

                                Table "public.identities"
        Column        |            Type             |              Modifiers
----------------------+-----------------------------+-------------------------------------
 id                   | character varying(36)       | not null default uuid_generate_v4()
 nickname             | character varying(255)      |

We need to assign reference ids to the items in the tables that will become our nodes. The row_number window function is perfectly suited for this trick.

SELECT row_number() OVER (ORDER BY id), id, nickname FROM identities ORDER BY id LIMIT 10;

 row_number |                  id                  |  nickname  
------------+--------------------------------------+------------
          1 | 00004b82-86de-4757-bf9e-9bebabbd234e | clkao      
          2 | 0000539c-aec2-45ba-b353-22fad81c1ae2 | Erre       
          3 | 00006460-2834-466a-88b4-7583bdf3142f | nanha      
          4 | 0000b675-e9fc-4091-bb2c-f2e6388220b9 | joemocha   
          5 | 0001951c-03dd-4c70-8e67-ff8ffb11c628 | BlueHotDog 

Let’s go ahead and create a new table with this data by using the INTO command:

SELECT row_number() OVER (ORDER BY id) as node_id, id, nickname INTO iden_nodes FROM identities ORDER BY id;

Getting the nodes was easy enough… how about relationships? In our database we have a join table called vouches where our identities relate to each other.

                                     Table "public.vouches"
     Column     |            Type             |                    Modifiers
----------------+-----------------------------+-------------------------------------------------
 id             | character varying(36)       | not null default uuid_generate_v4()
 requester_id   | character varying(36)       | not null
 grantor_id     | character varying(36)       | not null
 term_id        | character varying(36)       | not null
 status         | character varying(10)       | not null default 'requested'::character varying

Grantors vouch for Requesters using a Term. The name of the Term is stored in another table, but the relationship has a property called status. Grantors and Requesters are Identities and we need to join them to their reference ids we created earlier. The format we want is start – end – relationship type – relationship property_1…

So we can build that with:

SELECT FN.node_id AS start, TN.node_id AS end, T.name AS type, V.status
INTO vouch_rels
FROM vouches AS V
INNER JOIN iden_nodes AS FN ON V.grantor_id = FN.id
INNER JOIN iden_nodes AS TN ON v.requester_id = TN.id
INNER JOIN terms AS T ON V.term_id = T.id
ORDER BY grantor_id, requester_id;

 start |  end  |    type    |  status
-------+-------+------------+-----------
     1 |     1 | Perl       | confirmed
     1 |     1 | JavaScript | confirmed
     1 |  4096 | Perl       | confirmed
     1 | 12338 | JavaScript | confirmed
     1 | 12338 | Perl       | confirmed

In this case we have about 75 terms that I will use as relationship types. If I had 40,000 terms, I would use a type of “vouched_for” and make the term name one of the properties of the relationship. Now we can get the data out of these tables and into files the batch importer can use.

psql -c "copy (SELECT id, nickname FROM iden_nodes ORDER BY node_id) TO stdout csv header delimiter E'\t'" -o mynodes.csv -d prod_getvouched 

The command above will create a mynodes.csv file that looks like:

head -5 mynodes.csv

id      nickname  
00004b82-86de-4757-bf9e-9bebabbd234e    clkao   
0000539c-aec2-45ba-b353-22fad81c1ae2    Erre    
00006460-2834-466a-88b4-7583bdf3142f    nanha   
0000b675-e9fc-4091-bb2c-f2e6388220b9    joemocha

Notice we did not export the reference number since we do not need to have it as a property of our nodes. Now let’s get relationships.

 psql -c "copy (SELECT start, \"end\", type, status FROM vouch_rels ORDER BY start) TO stdout csv header delimiter E'\t'" -o myrels.csv -d prod_getvouched

We had to escape the end column since it is a keyword, and myrels.csv contains:

head -5 myrels.csv

start   end     type    status
1       1       JavaScript      confirmed
1       1       Perl    confirmed
1       12338   Perl    confirmed
1       25879   Perl    confirmed

Let’s go ahead and stop the Neo4j instance we have running and delete it’s database.

rake neo4j:stop
rm -rf neo4j/data/graph.db

Then we’ll run our batch import command putting the database in neo4j directly.

java -server -Xmx4G -jar target/batch-import-jar-with-dependencies.jar neo4j/data/graph.db mynodes.csv myrels.csv

…and we’ll go get another soda…whoops, no time, it’s done.

..
Importing 295951 Nodes took 1 seconds
............
Importing 1218043 Relationships took 4 seconds
9 seconds

…and we’ll restart our Neo4j server.

rake neo4j:start

Then we fire up our Neo4j Data Browser:

Awesome. This works great for social networks where the nodes are all in the same table…but what about if you are building a recommendation system and you have identities and items as nodes?

No problem. You follow the same technique to create your first nodes table and then…

SELECT max(node_id) from iden_nodes;
295951

SELECT 295951 + row_number() OVER (ORDER BY id), id, name, price FROM items ORDER BY id;

You just start from where your other node table left off. You can use UNION ALL to put them together on your export.

Give it a shot and let me know if you run into any trouble and I’ll be glad to help.

Tagged , , ,

5 thoughts on “Batch Importer – Part 2

  1. Thanks for the blog posts.

    The best follow-up post would be to access the postgresql db from java via jdbc and pumping that data directly into neo4j.

    • mapomi says:

      Michael,

      I’m trying to import data using Spring Data Neo4j, because I want to let the graph database prepared for use with Spring Data Neo4j because the code is more simple, readable and easy to mantain, but I have a serious performance problems, specially when I look up for a node to set a relation.

      Could you recommend me what I should do if I want to let the graph database prepared for use it in Spring Data Neo4j and improve considerably the import performance?

  2. [...] everything should be working correctly. In part 2 of this series, I’ll show you how to write some SQL queries to get your data into Neo4j. Share [...]

  3. Justin says:

    This is great. What are you opinions on the correct way to keep the two databases synchronized several times an hour?

  4. [...] Batch Importer – Part 2: Use of SQL to prepare files for import. [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 229 other followers