Batch Importer – Part 2

If you’ve been following along, we got Michael’s Batch Importer, compiled it, created some test data, ran it and saw millions of nodes and relationships loaded into Neo4j.

So now we’re ready for our own data. I am going to show you how to get data from a Relational Database like PostgreSQL into a format we can use. If you’re using SQL Server, MySQL, Oracle, etc, the directions will be slightly different, but you’ll get the picture.

So let’s say we have an identities table with the following layout:

                                Table "public.identities"
        Column        |            Type             |              Modifiers
----------------------+-----------------------------+-------------------------------------
 id                   | character varying(36)       | not null default uuid_generate_v4()
 nickname             | character varying(255)      |

We need to assign reference ids to the items in the tables that will become our nodes. The row_number window function is perfectly suited for this trick.

SELECT row_number() OVER (ORDER BY id), id, nickname FROM identities ORDER BY id LIMIT 10;

 row_number |                  id                  |  nickname  
------------+--------------------------------------+------------
          1 | 00004b82-86de-4757-bf9e-9bebabbd234e | clkao      
          2 | 0000539c-aec2-45ba-b353-22fad81c1ae2 | Erre       
          3 | 00006460-2834-466a-88b4-7583bdf3142f | nanha      
          4 | 0000b675-e9fc-4091-bb2c-f2e6388220b9 | joemocha   
          5 | 0001951c-03dd-4c70-8e67-ff8ffb11c628 | BlueHotDog 

Let’s go ahead and create a new table with this data by using the INTO command:

SELECT row_number() OVER (ORDER BY id) as node_id, id, nickname INTO iden_nodes FROM identities ORDER BY id;

Getting the nodes was easy enough… how about relationships? In our database we have a join table called vouches where our identities relate to each other.

                                     Table "public.vouches"
     Column     |            Type             |                    Modifiers
----------------+-----------------------------+-------------------------------------------------
 id             | character varying(36)       | not null default uuid_generate_v4()
 requester_id   | character varying(36)       | not null
 grantor_id     | character varying(36)       | not null
 term_id        | character varying(36)       | not null
 status         | character varying(10)       | not null default 'requested'::character varying

Grantors vouch for Requesters using a Term. The name of the Term is stored in another table, but the relationship has a property called status. Grantors and Requesters are Identities and we need to join them to their reference ids we created earlier. The format we want is start – end – relationship type – relationship property_1…

So we can build that with:

SELECT FN.node_id AS start, TN.node_id AS end, T.name AS type, V.status
INTO vouch_rels
FROM vouches AS V
INNER JOIN iden_nodes AS FN ON V.grantor_id = FN.id
INNER JOIN iden_nodes AS TN ON v.requester_id = TN.id
INNER JOIN terms AS T ON V.term_id = T.id
ORDER BY grantor_id, requester_id;

 start |  end  |    type    |  status
-------+-------+------------+-----------
     1 |     1 | Perl       | confirmed
     1 |     1 | JavaScript | confirmed
     1 |  4096 | Perl       | confirmed
     1 | 12338 | JavaScript | confirmed
     1 | 12338 | Perl       | confirmed

In this case we have about 75 terms that I will use as relationship types. If I had 40,000 terms, I would use a type of “vouched_for” and make the term name one of the properties of the relationship. Now we can get the data out of these tables and into files the batch importer can use.

psql -c "copy (SELECT id, nickname FROM iden_nodes ORDER BY node_id) TO stdout csv header delimiter E'\t'" -o mynodes.csv -d prod_getvouched 

The command above will create a mynodes.csv file that looks like:

head -5 mynodes.csv

id      nickname  
00004b82-86de-4757-bf9e-9bebabbd234e    clkao   
0000539c-aec2-45ba-b353-22fad81c1ae2    Erre    
00006460-2834-466a-88b4-7583bdf3142f    nanha   
0000b675-e9fc-4091-bb2c-f2e6388220b9    joemocha

Notice we did not export the reference number since we do not need to have it as a property of our nodes. Now let’s get relationships.

 psql -c "copy (SELECT start, \"end\", type, status FROM vouch_rels ORDER BY start) TO stdout csv header delimiter E'\t'" -o myrels.csv -d prod_getvouched

We had to escape the end column since it is a keyword, and myrels.csv contains:

head -5 myrels.csv

start   end     type    status
1       1       JavaScript      confirmed
1       1       Perl    confirmed
1       12338   Perl    confirmed
1       25879   Perl    confirmed

Let’s go ahead and stop the Neo4j instance we have running and delete it’s database.

rake neo4j:stop
rm -rf neo4j/data/graph.db

Then we’ll run our batch import command putting the database in neo4j directly.

java -server -Xmx4G -jar target/batch-import-jar-with-dependencies.jar neo4j/data/graph.db mynodes.csv myrels.csv

…and we’ll go get another soda…whoops, no time, it’s done.

..
Importing 295951 Nodes took 1 seconds
............
Importing 1218043 Relationships took 4 seconds
9 seconds

…and we’ll restart our Neo4j server.

rake neo4j:start

Then we fire up our Neo4j Data Browser:

Awesome. This works great for social networks where the nodes are all in the same table…but what about if you are building a recommendation system and you have identities and items as nodes?

No problem. You follow the same technique to create your first nodes table and then…

SELECT max(node_id) from iden_nodes;
295951

SELECT 295951 + row_number() OVER (ORDER BY id), id, name, price FROM items ORDER BY id;

You just start from where your other node table left off. You can use UNION ALL to put them together on your export.

Give it a shot and let me know if you run into any trouble and I’ll be glad to help.

Tagged , , ,

31 thoughts on “Batch Importer – Part 2

  1. Thanks for the blog posts.

    The best follow-up post would be to access the postgresql db from java via jdbc and pumping that data directly into neo4j.

    • mapomi says:

      Michael,

      I’m trying to import data using Spring Data Neo4j, because I want to let the graph database prepared for use with Spring Data Neo4j because the code is more simple, readable and easy to mantain, but I have a serious performance problems, specially when I look up for a node to set a relation.

      Could you recommend me what I should do if I want to let the graph database prepared for use it in Spring Data Neo4j and improve considerably the import performance?

  2. [...] everything should be working correctly. In part 2 of this series, I’ll show you how to write some SQL queries to get your data into Neo4j. Share [...]

  3. Justin says:

    This is great. What are you opinions on the correct way to keep the two databases synchronized several times an hour?

  4. [...] Batch Importer – Part 2: Use of SQL to prepare files for import. [...]

  5. Thanks, thats great!
    We want to import the users and groups from active directory for different projects. We want to do this on regular basis every 120 minutes. This could be up to 100,000 users and 500,000 groups per import.
    Can you give us a recomendation?

    Regards, Thomas

    • maxdemarzi says:

      As you can see above 300k nodes, and 1.2 Million relationships took 9 seconds to import into Neo4j, so you won’t have an issue there. If you can import your users and groups into just two tables on MS SQL Server with a numeric ID key as your primary key, you should be able to re-use that as your node id, and then just add the max value to the groups numeric ID key when doing the relationships. You can create a SSIS job to handle the ETL, or use something like bcp.

      • Hi,
        we get the nodes and the relationships directly over the Active Directory API with his properties. Is that your recomendation:
        1. build to CSV files for importing(without SQL-Server between): one for the Nodes with prop. and the second one for the relationsships?
        In the relationship file we use the “SID” property für START and END?
        Is there any indexing nessescary in the mean time?

  6. Feedback from importer?
    We want to start the Importer from a .net application. How we get the Feedback, that the import is ended?

    • maxdemarzi says:

      Thomas… I’m not sure how you’d get feedback from the importer into .net… I’m mostly a Ruby guy, so maybe somebody else can chime in. As far as your earlier question, I like having a staging database so I can do a proper ETL. You can clean up your data, check for duplicates or invalid input, add properties from another system etc. It is a bit of a safely blanket, but from my old relational db days, that’s just the way its done.

  7. [...] first one, I showed you how to install the Batch Importer, in the second one, I showed you how to use data in your relational database to generate the csv files to create your graph, and just recently I showed you how to quickly index your [...]

  8. [...] needed to import this into Neo4j. I followed the examples from Batch Importer Part 2, and Batch Importer Part 3 to do some ETL, but first I needed to load the data into Postgresql so I [...]

  9. Nims says:

    Hi Max,

    In this post you took example wherein you have one table is analogues to nodes (identity) and other is analogues to relations (vouches). What could be done if you have 10-12 tables
    and creating a single csv for nodes is unfeasible as well as single relation’s csv?

    Thanks,
    Nirmal

  10. maxdemarzi says:

    Take a look at http://maxdemarzi.com/2012/07/03/graph-generator/ to see how multiple object types are handled

    >8-]

  11. Kevin Burton says:

    I would like to hear comments on how to do this from SQL Server and a Windows platform.

  12. Raj Kapoor says:

    Is there a way to rebuild indices through cypher ! I went through the route of importing into gephi and then using the export to create neo4j database. However , i am not able to use any index related query , even after i make changes to the config file.

    dont want to use any other language than cypher or unix shell to make these changes.
    any suggestions?

  13. Kalyan says:

    Using Batch Importer, I am able import n number of nodes and relationships.

    I would like to add some more new nodes and relationships to existing Neo4j database. How could I reach this without loosing existing data.

    When I tried to import new nodes and relationships using batch importer its overwriting the existing data.

    Could you please suggest any efficient way to add more new nodes and relationship to Neo4j DB.

  14. xurxo says:

    Hi Max

    Thank you for post, is great.

    I want insert into neo4j different node type with diferent properties.

    Can you give me a recomendation?

  15. Mark says:

    Is there a way to load the nodes with a predefined ID? I’d really like to be able to do this so that I don’t need to do a transformation on my ‘start’, ‘end’ columns in myrels.csv

    Not a deal breaker but it will make things simpler (for me) when loading lots of data sets

    • maxdemarzi says:

      Not really… Maybe you can tweak the batch import code to do that?

      • Mark says:

        Yup. Seems the place to make the mod is the call to the BatchInserterImpl. This method is available in the API.

        void createNode(long id,
        Map properties)
        Creates a node with supplied id and properties. If a node with the given id exist a runtime exception will be thrown.
        Parameters:
        id – the id of the node to create.
        properties – map containing properties or null if no properties should be added.

        I’ll let you know how I get on. This will be really useful because the data I’m importing comes from a relational DB with referential integrity so I know the ‘node’ id’s referred to by the ‘relationships’ are already validated.

  16. Daniel says:

    Hi Max, in the above explanation create nodes with data from the same table identities?

  17. rockieNeo4j says:

    Hi Max,

    I’m able to create nodes and relationships using this approach in Neo4j 2.0; but the problem is that labels are not being created even though I have a column to specify them in the nodes.csv file. I’m using branch 20 which supports Neo4j 2.0 and I can’t see any warning when I run the importer.

    Do you know what it could be happening that labels are not being created? Is there any configuration that I may be missing?

    Thanks.

  18. shihab rahman says:

    I have multiple types of nodes.In that case what should be the structure of nodes.csv?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,579 other followers

%d bloggers like this: