Let’s try tackling something a little bigger. In Part 1 we created a small graph to test our permission resolution graph algorithm and it worked like a charm on our dozen or so nodes and edges. I don’t have fast hands, so instead of typing out a million node graph, we’ll build a graph generator and use the batch importer to load it into Neo4j. What I want to create is a set of files to feed to the batch-importer.
A nodes.csv file (is actually tab separated, the c is just there to make sure you were paying attention) looks like the following:
unique_id type 9a984170-71cc-0130-92a0-20c9d042eca9 user 9a984450-71cc-0130-92a0-20c9d042eca9 user 9a984550-71cc-0130-92a0-20c9d042eca9 user ... a67769a0-71cc-0130-92a0-20c9d042eca9 doc a6776a40-71cc-0130-92a0-20c9d042eca9 doc
The node ids are not set above, instead the line number represents the node id that will be created in our graph. The unique_id and type are properties of the nodes. A rels.csv file is also needed, and it looks like the following:
start end type flags 1 3003 IS_MEMBER_OF 1 3060 IS_MEMBER_OF 2 3032 IS_MEMBER_OF ... 754949 272265 IS_CHILD_OF 825621 283395 IS_CHILD_OF
In this case the start and end columns are the node ids these relationships are connecting via a type (required) and some properties (if any). To make these files I build a quick Rakefile to help me make two sets of these. One graph will have a million nodes, the other 10 million and we will see how well Neo4j scales in this regard. Will a 10x increase in the number of documents in the graph result in 1/10x performance of our algorithm?
require 'neography/tasks' require './neo_generate.rb' namespace :neo4j do task :create do %x[rm *.csv] create_graph end task :create_bigger do %x[rm *.csv] create_bigger_graph end task :load do %x[rm -rf neo4j/data/graph.db] load_graph end end
The create graph and create bigger graph methods are almost identical, the only real difference is how many nodes they will end up creating:
def create_graph create_node_properties create_nodes create_nodes_index create_relationship_properties create_relationships end
def create_bigger_graph create_node_properties create_more_nodes create_nodes_index create_relationship_properties create_relationships end
We are going to set the first 3000 nodes to be users, the next 100 to be groups, and the next 1 million to be documents.
def create_nodes @nodes = { "user" => { "start" => 1, "end" => 3000}, "group" => { "start" => 3001, "end" => 3100}, "doc" => { "start" => 3101, "end" =>1003100} } @nodes.each{ |node| generate_nodes(node[0], node[1])} end
rake neo4j:create
The csv files created for the 1 Million node graph aren’t very large:
-rw-r--r-- 1 maxdemarzi staff 47M Mar 18 02:37 documents_index.csv -rw-r--r-- 1 maxdemarzi staff 40M Mar 18 02:37 nodes.csv -rw-r--r-- 1 maxdemarzi staff 259M Mar 18 02:46 rels.csv -rw-r--r-- 1 maxdemarzi staff 140K Mar 18 02:37 users_index.csv
Now let’s load these in:
rake neo4j:load java -server -Xmx4G -jar ./batch-import-jar-with-dependencies.jar neo4j/data/graph.db nodes.csv rels.csv node_index Users exact users_index.csv node_index Documents exact documents_index.csv
This will run the following:
Using Existing Configuration File .......... Importing 1003100 Nodes took 2 seconds .................................................................................................... 19508 ms for 10000000 ......... Importing 10976303 Relationships took 21 seconds Importing 3000 Done inserting into Users Index took 0 seconds .......... Importing 1000000 Done inserting into Documents Index took 7 seconds Total import time: 34 seconds
Not bad for 1 million nodes and 10 million relationships:
rake neo4j:start
Once we start neo4j and take a look at the web admin, we can see our graph:
For our bigger graph, we just add another zero and create 10 Million documents.
def create_more_nodes @nodes = { "user" => { "start" => 1, "end" => 3000}, "group" => { "start" => 3001, "end" => 3100}, "doc" => { "start" => 3101, "end" =>10003100} } @nodes.each{ |node| generate_nodes(node[0], node[1])} end
We’ll run a different rake task which will overwrite the smaller csv files.
rake neo4j:create_bigger
The csv files created for the 10 Million node graph are just a tad bigger than for 1 Million node graph:
-rw-r--r-- 1 maxdemarzi staff 476M Mar 19 00:33 documents_index.csv -rw-r--r-- 1 maxdemarzi staff 401M Mar 19 00:31 nodes.csv -rw-r--r-- 1 maxdemarzi staff 510M Mar 19 05:16 rels.csv -rw-r--r-- 1 maxdemarzi staff 140K Mar 19 00:33 users_index.csv
Let’s stop the neo4j server and load the bigger graph instead:
rake neo4j:stop rake neo4j:load java -server -Xmx4G -jar ./batch-import-jar-with-dependencies.jar neo4j/data/graph.db nodes.csv rels.csv node_index Users exact users_index.csv node_index Documents exact documents_index.csv
I wonder how long this will take:
Using Existing Configuration File .................................................................................................... 14052 ms for 10000000 Importing 10003100 Nodes took 14 seconds .................................................................................................... 19242 ms for 10000000 .................................................................................................. Importing 19812750 Relationships took 37 seconds Importing 3000 Done inserting into Users Index took 0 seconds .................................................................................................... 64223 ms for 10000000 Importing 10000000 Done inserting into Documents Index took 64 seconds Total import time: 135 seconds
That’s not bad either. Just over two minutes.
rake neo4j:start
Alright. Now we have two bigger graphs we can play with. Stay tuned for the next part where I’ll add two Gatling performance tests to the mix.
[…] here again. The first test will randomly choose users and documents (from the graph we created in part 2) and write the results to a file, the second test will re-use the results of the first one and run […]