Permission Resolution with Neo4j – Part 2

Let’s try tackling something a little bigger. In Part 1 we created a small graph to test our permission resolution graph algorithm and it worked like a charm on our dozen or so nodes and edges. I don’t have fast hands, so instead of typing out a million node graph, we’ll build a graph generator and use the batch importer to load it into Neo4j. What I want to create is a set of files to feed to the batch-importer.

A nodes.csv file (is actually tab separated, the c is just there to make sure you were paying attention) looks like the following:

unique_id       type
9a984170-71cc-0130-92a0-20c9d042eca9    user    
9a984450-71cc-0130-92a0-20c9d042eca9    user    
9a984550-71cc-0130-92a0-20c9d042eca9    user
...
a67769a0-71cc-0130-92a0-20c9d042eca9    doc     
a6776a40-71cc-0130-92a0-20c9d042eca9    doc

The node ids are not set above, instead the line number represents the node id that will be created in our graph. The unique_id and type are properties of the nodes. A rels.csv file is also needed, and it looks like the following:

start   end     type    flags
1       3003    IS_MEMBER_OF            
1       3060    IS_MEMBER_OF            
2       3032    IS_MEMBER_OF    
...
754949  272265  IS_CHILD_OF             
825621  283395  IS_CHILD_OF

In this case the start and end columns are the node ids these relationships are connecting via a type (required) and some properties (if any). To make these files I build a quick Rakefile to help me make two sets of these. One graph will have a million nodes, the other 10 million and we will see how well Neo4j scales in this regard. Will a 10x increase in the number of documents in the graph result in 1/10x performance of our algorithm?

require 'neography/tasks'
require './neo_generate.rb'

namespace :neo4j do
  task :create do
    %x[rm *.csv]
    create_graph
  end

  task :create_bigger do
    %x[rm *.csv]
    create_bigger_graph
  end
  
  task :load do
    %x[rm -rf neo4j/data/graph.db]
    load_graph
  end
end

The create graph and create bigger graph methods are almost identical, the only real difference is how many nodes they will end up creating:

  def create_graph
    create_node_properties
    create_nodes
    create_nodes_index
    create_relationship_properties
    create_relationships
  end

  def create_bigger_graph
    create_node_properties
    create_more_nodes
    create_nodes_index
    create_relationship_properties
    create_relationships
  end

We are going to set the first 3000 nodes to be users, the next 100 to be groups, and the next 1 million to be documents.

  def create_nodes    
   @nodes = {
             "user"  => { "start" =>     1, "end"   =>   3000},
             "group" => { "start" =>  3001, "end"   =>   3100},
             "doc"   => { "start" =>  3101, "end"   =>1003100}
            }
    
    @nodes.each{ |node| generate_nodes(node[0], node[1])}
  end

rake neo4j:create

The csv files created for the 1 Million node graph aren’t very large:

-rw-r--r--   1 maxdemarzi  staff    47M Mar 18 02:37 documents_index.csv
-rw-r--r--   1 maxdemarzi  staff    40M Mar 18 02:37 nodes.csv
-rw-r--r--   1 maxdemarzi  staff   259M Mar 18 02:46 rels.csv
-rw-r--r--   1 maxdemarzi  staff   140K Mar 18 02:37 users_index.csv

Now let’s load these in:

rake neo4j:load
java -server -Xmx4G -jar ./batch-import-jar-with-dependencies.jar neo4j/data/graph.db nodes.csv rels.csv node_index Users exact users_index.csv node_index Documents exact documents_index.csv

This will run the following:

Using Existing Configuration File
..........
Importing 1003100 Nodes took 2 seconds 
.................................................................................................... 19508 ms for 10000000
.........
Importing 10976303 Relationships took 21 seconds 

Importing 3000 Done inserting into Users Index took 0 seconds 
..........
Importing 1000000 Done inserting into Documents Index took 7 seconds 

Total import time: 34 seconds

Not bad for 1 million nodes and 10 million relationships:

rake neo4j:start

Once we start neo4j and take a look at the web admin, we can see our graph:

For our bigger graph, we just add another zero and create 10 Million documents.

def create_more_nodes    
   @nodes = {
             "user"  => { "start" =>     1, "end"   =>    3000},
             "group" => { "start" =>  3001, "end"   =>    3100},
             "doc"   => { "start" =>  3101, "end"   =>10003100}
            }
    
    @nodes.each{ |node| generate_nodes(node[0], node[1])}
  end

We’ll run a different rake task which will overwrite the smaller csv files.

rake neo4j:create_bigger

The csv files created for the 10 Million node graph are just a tad bigger than for 1 Million node graph:

-rw-r--r--  1 maxdemarzi  staff   476M Mar 19 00:33 documents_index.csv
-rw-r--r--  1 maxdemarzi  staff   401M Mar 19 00:31 nodes.csv
-rw-r--r--  1 maxdemarzi  staff   510M Mar 19 05:16 rels.csv
-rw-r--r--  1 maxdemarzi  staff   140K Mar 19 00:33 users_index.csv

Let’s stop the neo4j server and load the bigger graph instead:

rake neo4j:stop
rake neo4j:load
java -server -Xmx4G -jar ./batch-import-jar-with-dependencies.jar neo4j/data/graph.db nodes.csv rels.csv node_index Users exact users_index.csv node_index Documents exact documents_index.csv

I wonder how long this will take:

Using Existing Configuration File
.................................................................................................... 14052 ms for 10000000

Importing 10003100 Nodes took 14 seconds 
.................................................................................................... 19242 ms for 10000000
..................................................................................................
Importing 19812750 Relationships took 37 seconds 

Importing 3000 Done inserting into Users Index took 0 seconds 
.................................................................................................... 64223 ms for 10000000

Importing 10000000 Done inserting into Documents Index took 64 seconds 

Total import time: 135 seconds

That’s not bad either. Just over two minutes.

rake neo4j:start

Alright. Now we have two bigger graphs we can play with. Stay tuned for the next part where I’ll add two Gatling performance tests to the mix.

One thought on “Permission Resolution with Neo4j – Part 2”

Permission Resolution with Neo4j – Part 3 | Max De Marzi says:

March 25, 2013 at 9:11 AM

[…] here again. The first test will randomly choose users and documents (from the graph we created in part 2) and write the results to a file, the second test will re-use the results of the first one and run […]

Max De Marzi

Graphs, Graphs, and nothing but the Graphs