If you’ve been following along, we got Michael’s Batch Importer, compiled it, created some test data, ran it and saw millions of nodes and relationships loaded into Neo4j.
So now we’re ready for our own data. I am going to show you how to get data from a Relational Database like PostgreSQL into a format we can use. If you’re using SQL Server, MySQL, Oracle, etc, the directions will be slightly different, but you’ll get the picture.
So let’s say we have an identities table with the following layout:
Table "public.identities" Column | Type | Modifiers ----------------------+-----------------------------+------------------------------------- id | character varying(36) | not null default uuid_generate_v4() nickname | character varying(255) |
We need to assign reference ids to the items in the tables that will become our nodes. The row_number window function is perfectly suited for this trick.
SELECT row_number() OVER (ORDER BY id), id, nickname FROM identities ORDER BY id LIMIT 10; row_number | id | nickname ------------+--------------------------------------+------------ 1 | 00004b82-86de-4757-bf9e-9bebabbd234e | clkao 2 | 0000539c-aec2-45ba-b353-22fad81c1ae2 | Erre 3 | 00006460-2834-466a-88b4-7583bdf3142f | nanha 4 | 0000b675-e9fc-4091-bb2c-f2e6388220b9 | joemocha 5 | 0001951c-03dd-4c70-8e67-ff8ffb11c628 | BlueHotDog
Let’s go ahead and create a new table with this data by using the INTO command:
SELECT row_number() OVER (ORDER BY id) as node_id, id, nickname INTO iden_nodes FROM identities ORDER BY id;
Getting the nodes was easy enough… how about relationships? In our database we have a join table called vouches where our identities relate to each other.
Table "public.vouches" Column | Type | Modifiers ----------------+-----------------------------+------------------------------------------------- id | character varying(36) | not null default uuid_generate_v4() requester_id | character varying(36) | not null grantor_id | character varying(36) | not null term_id | character varying(36) | not null status | character varying(10) | not null default 'requested'::character varying
Grantors vouch for Requesters using a Term. The name of the Term is stored in another table, but the relationship has a property called status. Grantors and Requesters are Identities and we need to join them to their reference ids we created earlier. The format we want is start – end – relationship type – relationship property_1…
So we can build that with:
SELECT FN.node_id AS start, TN.node_id AS end, T.name AS type, V.status INTO vouch_rels FROM vouches AS V INNER JOIN iden_nodes AS FN ON V.grantor_id = FN.id INNER JOIN iden_nodes AS TN ON v.requester_id = TN.id INNER JOIN terms AS T ON V.term_id = T.id ORDER BY grantor_id, requester_id; start | end | type | status -------+-------+------------+----------- 1 | 1 | Perl | confirmed 1 | 1 | JavaScript | confirmed 1 | 4096 | Perl | confirmed 1 | 12338 | JavaScript | confirmed 1 | 12338 | Perl | confirmed
In this case we have about 75 terms that I will use as relationship types. If I had 40,000 terms, I would use a type of “vouched_for” and make the term name one of the properties of the relationship. Now we can get the data out of these tables and into files the batch importer can use.
psql -c "copy (SELECT id, nickname FROM iden_nodes ORDER BY node_id) TO stdout csv header delimiter E'\t'" -o mynodes.csv -d prod_getvouched
The command above will create a mynodes.csv file that looks like:
head -5 mynodes.csv id nickname 00004b82-86de-4757-bf9e-9bebabbd234e clkao 0000539c-aec2-45ba-b353-22fad81c1ae2 Erre 00006460-2834-466a-88b4-7583bdf3142f nanha 0000b675-e9fc-4091-bb2c-f2e6388220b9 joemocha
Notice we did not export the reference number since we do not need to have it as a property of our nodes. Now let’s get relationships.
psql -c "copy (SELECT start, \"end\", type, status FROM vouch_rels ORDER BY start) TO stdout csv header delimiter E'\t'" -o myrels.csv -d prod_getvouched
We had to escape the end column since it is a keyword, and myrels.csv contains:
head -5 myrels.csv start end type status 1 1 JavaScript confirmed 1 1 Perl confirmed 1 12338 Perl confirmed 1 25879 Perl confirmed
Let’s go ahead and stop the Neo4j instance we have running and delete it’s database.
rake neo4j:stop rm -rf neo4j/data/graph.db
Then we’ll run our batch import command putting the database in neo4j directly.
java -server -Xmx4G -jar target/batch-import-jar-with-dependencies.jar neo4j/data/graph.db mynodes.csv myrels.csv
…and we’ll go get another soda…whoops, no time, it’s done.
.. Importing 295951 Nodes took 1 seconds ............ Importing 1218043 Relationships took 4 seconds 9 seconds
…and we’ll restart our Neo4j server.
rake neo4j:start
Then we fire up our Neo4j Data Browser:
Awesome. This works great for social networks where the nodes are all in the same table…but what about if you are building a recommendation system and you have identities and items as nodes?
No problem. You follow the same technique to create your first nodes table and then…
SELECT max(node_id) from iden_nodes; 295951 SELECT 295951 + row_number() OVER (ORDER BY id), id, name, price FROM items ORDER BY id;
You just start from where your other node table left off. You can use UNION ALL to put them together on your export.
Give it a shot and let me know if you run into any trouble and I’ll be glad to help.
Thanks for the blog posts.
The best follow-up post would be to access the postgresql db from java via jdbc and pumping that data directly into neo4j.
Michael,
I’m trying to import data using Spring Data Neo4j, because I want to let the graph database prepared for use with Spring Data Neo4j because the code is more simple, readable and easy to mantain, but I have a serious performance problems, specially when I look up for a node to set a relation.
Could you recommend me what I should do if I want to let the graph database prepared for use it in Spring Data Neo4j and improve considerably the import performance?
[…] everything should be working correctly. In part 2 of this series, I’ll show you how to write some SQL queries to get your data into Neo4j. Share […]
This is great. What are you opinions on the correct way to keep the two databases synchronized several times an hour?
[…] Batch Importer – Part 2: Use of SQL to prepare files for import. […]
Thanks, thats great!
We want to import the users and groups from active directory for different projects. We want to do this on regular basis every 120 minutes. This could be up to 100,000 users and 500,000 groups per import.
Can you give us a recomendation?
Regards, Thomas
As you can see above 300k nodes, and 1.2 Million relationships took 9 seconds to import into Neo4j, so you won’t have an issue there. If you can import your users and groups into just two tables on MS SQL Server with a numeric ID key as your primary key, you should be able to re-use that as your node id, and then just add the max value to the groups numeric ID key when doing the relationships. You can create a SSIS job to handle the ETL, or use something like bcp.
Hi,
we get the nodes and the relationships directly over the Active Directory API with his properties. Is that your recomendation:
1. build to CSV files for importing(without SQL-Server between): one for the Nodes with prop. and the second one for the relationsships?
In the relationship file we use the “SID” property für START and END?
Is there any indexing nessescary in the mean time?
Feedback from importer?
We want to start the Importer from a .net application. How we get the Feedback, that the import is ended?
Thomas… I’m not sure how you’d get feedback from the importer into .net… I’m mostly a Ruby guy, so maybe somebody else can chime in. As far as your earlier question, I like having a staging database so I can do a proper ETL. You can clean up your data, check for duplicates or invalid input, add properties from another system etc. It is a bit of a safely blanket, but from my old relational db days, that’s just the way its done.
[…] first one, I showed you how to install the Batch Importer, in the second one, I showed you how to use data in your relational database to generate the csv files to create your graph, and just recently I showed you how to quickly index your […]
[…] needed to import this into Neo4j. I followed the examples from Batch Importer Part 2, and Batch Importer Part 3 to do some ETL, but first I needed to load the data into Postgresql so I […]
Hi Max,
In this post you took example wherein you have one table is analogues to nodes (identity) and other is analogues to relations (vouches). What could be done if you have 10-12 tables
and creating a single csv for nodes is unfeasible as well as single relation’s csv?
Thanks,
Nirmal
Take a look at https://maxdemarzi.com/2012/07/03/graph-generator/ to see how multiple object types are handled
>8-]
I would like to hear comments on how to do this from SQL Server and a Windows platform.
The sql query should work on SQL Server (the ROW_NUMBER function is the same) http://msdn.microsoft.com/en-us/library/ms186734.aspx
As far as getting the data into CSV format, you can do that with SQL Server Management Studio => http://stackoverflow.com/questions/799465/how-to-export-sql-server-2005-query-to-csv
Is there a way to rebuild indices through cypher ! I went through the route of importing into gephi and then using the export to create neo4j database. However , i am not able to use any index related query , even after i make changes to the config file.
dont want to use any other language than cypher or unix shell to make these changes.
any suggestions?
Edit conf/neo4j.properties:
node_auto_indexing=true
node_keys_indexable=foo
Restart server.
Run:
START n=node(*)
SET n.foo = n.foo
RETURN COUNT(n)
Awesome ! it works – glad i asked
Using Batch Importer, I am able import n number of nodes and relationships.
I would like to add some more new nodes and relationships to existing Neo4j database. How could I reach this without loosing existing data.
When I tried to import new nodes and relationships using batch importer its overwriting the existing data.
Could you please suggest any efficient way to add more new nodes and relationship to Neo4j DB.
There are ways to do it with the batch-import, you just have to hack it up yourself.
I created an issue to see if Michael can make it part of the tool => https://github.com/jexp/batch-import/issues/27
Hi Max
Thank you for post, is great.
I want insert into neo4j different node type with diferent properties.
Can you give me a recomendation?
Hi,
I have a more complicated example blog post coming up… you can take a look at neo_doc.commands in https://github.com/maxdemarzi/neo_doc .
Regards,
Max
Is there a way to load the nodes with a predefined ID? I’d really like to be able to do this so that I don’t need to do a transformation on my ‘start’, ‘end’ columns in myrels.csv
Not a deal breaker but it will make things simpler (for me) when loading lots of data sets
Not really… Maybe you can tweak the batch import code to do that?
Yup. Seems the place to make the mod is the call to the BatchInserterImpl. This method is available in the API.
void createNode(long id,
Map properties)
Creates a node with supplied id and properties. If a node with the given id exist a runtime exception will be thrown.
Parameters:
id – the id of the node to create.
properties – map containing properties or null if no properties should be added.
I’ll let you know how I get on. This will be really useful because the data I’m importing comes from a relational DB with referential integrity so I know the ‘node’ id’s referred to by the ‘relationships’ are already validated.
Hi Max, in the above explanation create nodes with data from the same table identities?
check this post: https://github.com/jexp/neo4j-shell-tools#import-data-into-your-neo4j-database-from-the-neo4j-shell-command
Hi Max,
I’m able to create nodes and relationships using this approach in Neo4j 2.0; but the problem is that labels are not being created even though I have a column to specify them in the nodes.csv file. I’m using branch 20 which supports Neo4j 2.0 and I can’t see any warning when I run the importer.
Do you know what it could be happening that labels are not being created? Is there any configuration that I may be missing?
Thanks.
Actually, you have to use the 2.0 branch of the batch importer to get labels.
https://github.com/jexp/batch-import/tree/20
I have multiple types of nodes.In that case what should be the structure of nodes.csv?
[…] part 1 ( https://maxdemarzi.com/2012/02/28/batch-importer-part-1/ ) and part 2 ( https://maxdemarzi.com/2012/02/28/batch-importer-part-2/ ) It operates in 3 […]