Update: Code to this project is available on Github.
In the US Air Guitar Championships, competitors use their talents to fret on an “invisible” guitar to rock a live crowd and deliver a performance that transcends the imitation of a real guitar and becomes an art form in and of itself. The key factor that determines the winner is having the elusive quality of “Airness“. When considering using Neo4j in a project, one of the key considerations is having a domain model that yields itself to a graph representation. In other words, does your data have “Graphiness“. However, it didn’t dawn on me until recently that when starting a proof of concept, you probably don’t have that data (or enough of it) or maybe your security guys won’t let you within 100 miles of the company production data with this newfangled nosql thingamajig.
So in order to validate our ideas and build a proof of concept, we’ll need to generate sample data and test our algorithms (aka Cypher and Gremlin queries) against it. I will show you how to build a rudimentary graph generator, you’ll have to tweak it to match your domain, but it’s a start. We’ll also going to use the Batch Importer to quickly load our data into Neo4j.
If you recall, I’ve had three blog posts about the Batch Importer. In the first one, I showed you how to install the Batch Importer, in the second one, I showed you how to use data in your relational database to generate the csv files to create your graph, and just recently I showed you how to quickly index your data.
The Batch Importer expects a series of tab separated files for input. So let’s generate these files. We will create a graph with 6 node types. Here they are with the amount of each we are going to create:
# Nodes # Users 21,000 # Companies 4,000 # Activity 1.2M # Item 1.3M # Entity 3.5M # Tags 20,000
We’ll link these together with a set of relationships:
# Relationships # Users -[:belongs_to]-> Companies # User -[:performs]-> Activity # Activity -[:belongs_to]-> Item # Item -[:references]-> Entity # Item -[:tagged]-> Tags #
To make this example easier, we’ll create just two indexes. A fulltext node index called “vertices” and an exact relationship index called “edges”. You’ll probably want to create multiple indexes for each type of node or relationship.
I want to make running this straight forward, so we’ll do this in a series of rake commands:
rake neo4j:install rake neo4j:create rake neo4j:load rake neo4j:start
If you’ve been following my blog, you know what install and start do, but we need to build the method that will handle create, and load. We can whip up a quick Rakefile for these:
require 'neography/tasks' require './neo_generate.rb' namespace :neo4j do task :create do create_graph end task :load do load_graph end end
Now we can start with create_graph. If you recall, the batch importer is looking for a series of tab separated files. One which contains the nodes, another for the relationships and optionally other files for each index you want to create. Each file had a header with some properties, so our create_graph method will look like this:
def create_graph create_node_properties create_nodes create_nodes_index create_relationship_properties create_relationships create_relationships_index end
I’m going to arbitrarily decide here that each one of my nodes will have two properties, and we’ll call these property1 and property2 because I am super creative when it comes to naming things.
def create_node_properties @node_properties = ["type", "property1", "property2"] generate_node_properties(@node_properties) end
Did I say two? I meant three. Just for my own sanity I like to give nodes a type property and put the type of node that they are, so we’ll include “type” as the first property.
# Recreate nodes.csv and set the node properties # def generate_node_properties(args) File.open("nodes.csv", "w") do |file| file.puts properties.join("\t") end end
With our header out of the way, we can turn our attention to actually creating these nodes. We’ll use a hash which will have the type of node, the start id and end id of the nodes, and some properties. But what properties should our nodes have? What should their values be? This is a bit tricky. The simplest solution is to just generate gobbledygook with random strings:
# Generate random lowercase text of a given length # # Args # length - Integer (default = 8) # def generate_text(length=8) chars = 'abcdefghjkmnpqrstuvwxyz' key = '' length.times { |i| key << chars[rand(chars.length)] } key end
Another possibility is to use one of the Random Data Generator Gems like Forgery to create more intelligent and specific random data (like female first names for example). We are going to take the easy way out this time and just give each node two random properties, except for user and company which will get properties from Forgery.
def create_nodes # Define Node Property Values node_values = [lambda { generate_text }, lambda { generate_text }] user_values = [lambda { Forgery::Name.full_name }, lambda { Forgery::Personal.language }] company_values = [lambda { Forgery::Name.company_name }, lambda { Forgery::Name.industry }] @nodes = {"user" => { "start" => 1, "end" => 21000, "props" => user_values}, "company" => { "start" => 21001, "end" => 25000, "props" => company_values}, "activity" => { "start" => 25001, "end" => 1225000, "props" => node_values}, "item" => { "start" => 1225001, "end" => 2525000, "props" => node_values}, "entity" => { "start" => 2525001, "end" => 6025000, "props" => node_values}, "tag" => { "start" => 6025001, "end" => 6045000, "props" => node_values} } # Write nodes to file @nodes.each{ |node| generate_nodes(node[0], node[1])} end
Great, now to finally generate these nodes, we’ll write to nodes.csv the type of the node and we’ll call our lambda so each node gets a different random string.
# Generate nodes given a type and hash # def generate_nodes(type, hash) puts "Generating #{(1 + hash["end"] - hash["start"])} #{type} nodes..." nodes = File.open("nodes.csv", "a") (1 + hash["end"] - hash["start"]).times do |t| properties = [type] + hash["props"].collect{|l| l.call} nodes.puts properties.join("\t") end nodes.close end
Our nodes.csv file will look like this once it’s done:
type property1 property2 user Helen Harvey Kashmiri user Sean Matthews Afrikaans user William Harper Haitian Creole user Bruce Hill Macedonian user Chris Riley Swahili
With nodes out of the way, it’s time for relationships. We’ll keep it simple and say each relationship also has two properties.
def create_relationship_properties @rel_properties = ["property1", "property2"] generate_rel_properties(@rel_properties) end
I meant three properties. Once again I’m adding type, but this is different from the node type above as each relationship in Neo4j MUST have a type, it is not an optional property. The “\t” you see below is putting tabs between each field, sorry if I didn’t mention this earlier and you were like what the heck is that?
# Recreate rels.csv and set the relationship properties # def generate_rel_properties(properties) File.open("rels.csv", "w") do |file| header = ["start", "end", "type"] + properties file.puts header.join("\t") end end
I showed you how to create nice fake data for the nodes, so we’ll keep it simple here and just do bland random 8 character strings. I am using the number field to set how many of these relationships will be created, their type is required, and some properties. You’ll also notice I have this “connection” key which is either :sequential or :random. I’ll explain that in a bit.
def create_relationships # Define Relationsihp Property Values rel_values = [lambda { generate_text }, lambda { generate_text }] rels = {"user_to_company" => { "from" => @nodes["user"], "to" => @nodes["company"], "number" => 21000, "type" => "belongs_to", "props" => rel_values, "connection" => :sequential }, "user_to_activity" => { "from" => @nodes["user"], "to" => @nodes["activity"], "number" => 1200000, "type" => "performs", "props" => rel_values, "connection" => :random }, "activity_to_item" => { "from" => @nodes["activity"], "to" => @nodes["item"], "number" => 3000000, "type" => "belongs", "props" => rel_values, "connection" => :random }, "item_to_entity" => { "from" => @nodes["item"], "to" => @nodes["entity"], "number" => 6000000, "type" => "references", "props" => rel_values, "connection" => :random }, "item_to_tag" => { "from" => @nodes["item"], "to" => @nodes["tag"], "number" => 250000, "type" => "tagged", "props" => rel_values, "connection" => :random } } # Write relationships to file rels.each{ |rel| generate_rels(rel[1])} end
I am using the “connection” to decide how to connect these nodes together. I’m generating either random connections between nodes or generating sequential connections (as in each “from node” connects to one “to node” until there are no more connections, and if there are more connections than nodes, we loop around).
Feel free to combine the two or create new connection types (clustered for example).
def generate_rels(hash) puts "Generating #{hash["number"]} #{hash["type"]} relationships..." File.open("rels.csv", "a") do |file| case hash["connection"] when :random hash["number"].times do |t| file.puts "#{rand(hash["from"]["start"]..hash["from"]["end"])}\t#{rand(hash["to"]["start"]..hash["to"]["end"])}\t#{hash["type"]}\t#{hash["props"].collect{|l| l.call}.join("\t")}" end when :sequential from_size = hash["from"]["end"] - hash["from"]["start"] to_size = hash["to"]["end"] - hash["to"]["start"] hash["number"].times do |t| file.puts "#{hash["from"]["start"] + (t % from_size)}\t#{hash["to"]["start"] + (t % to_size)}\t#{hash["type"]}\t#{hash["props"].collect{|l| l.call}.join("\t")}" end end end end
Our rels.csv file will look like this once it’s done:
start end type property1 property2 1 21001 belongs_to sjqwkvag vpxahvcr 2 21002 belongs_to pfxnxznu vrprnpky 3 21003 belongs_to gcyxumgy nrxepdzb 4 21004 belongs_to aayyejkw xpenqebd 5 21005 belongs_to hvhjexas kmyqucmn
To create our node index, we will simply open nodes.csv and output it, adding the node id as the first column. Michael is working on using the nodes.csv headers as a way to tell the Batch Importer to index the nodes, but until that work is done, this will work.
def create_nodes_index puts "Generating Node Index..." nodes = File.open("nodes.csv", "r") nodes_index = File.open("nodes_index.csv","w") counter = 0 while (line = nodes.gets) nodes_index.puts "#{counter}\t#{line}" counter += 1 end nodes.close nodes_index.close end
Therefore nodes_index.csv will look like:
0 type property1 property2 1 user Helen Harvey Kashmiri 2 user Sean Matthews Afrikaans 3 user William Harper Haitian Creole 4 user Bruce Hill Macedonian 5 user Chris Riley Swahili
We’ll do something similar with the relationships, but skip the starting and ending nodes as well as the relationship type.
def create_relationships_index puts "Generating Relationship Index..." rels = File.open("rels.csv", "r") rels_index = File.open("rels_index.csv","w") counter = -1 while (line = rels.gets) size ||= line.split("\t").size rels_index.puts "#{counter}\t#{line.split("\t")[3..size].join("\t")}" counter += 1 end rels.close rels_index.close end
Our rels_index.csv file will look like:
-1 property1 property2 0 nwjsbmgg gnsnefrf 1 szqqygra maumqtnp 2 pdtamztw uvcserrp 3 wewdtztx bkezsmva 4 gynprabv eszjgmfs 5 drcaxsse ungxbzzm
Let’s run neo4j:create to generate these files. Now would be a good time for a quick stretch, bio-break, etc. as this could take a couple of minutes.
Generating 21000 user nodes... Generating 4000 company nodes... Generating 1200000 activity nodes... Generating 1300000 item nodes... Generating 3500000 entity nodes... Generating 20000 tag nodes... Generating Node Index... Generating 21000 belongs_to relationships... Generating 1200000 performs relationships... Generating 3000000 belongs relationships... Generating 6000000 references relationships... Generating 250000 tagged relationships... Generating Relationship Index...
Welcome back, so now we have these four csv files generated, we need to actually run the batch importer to get them into Neo4j. So we will run rake neo4j:load to make this happen, which as you remember calls the load_graph method. It looks like this:
# Execute the command needed to import the generated files # def load_graph puts "Running the following:" command ="java -server -Xmx4G -jar ../batch-import/target/batch-import-jar-with-dependencies.jar neo4j/data/graph.db nodes.csv rels.csv node_index vertices fulltext nodes_index.csv rel_index edges exact rels_index.csv" puts command exec command end
The batch importer will now do its thing:
............................................................ Importing 6045000 Nodes took 39 seconds ....................................................................................................377369 ms for 10000000 .... Importing 10471000 Relationships took 476 seconds ............................................................ Importing 6045000 Nodes into vertices Index took 226 seconds ....................................................................................................261031 ms for 10000000 .... Importing 10471000 Relationships into edges Index took 266 seconds 1153 seconds
Now we can run rake neo4j:start to see our graph in Neo4j.
Let’s jump in to the Console and make sure our data is there:
START me = node:vertices(type="user") RETURN me LIMIT 5
Success!
==> +-------------------------------------------------------------------------------+ ==> | me | ==> +-------------------------------------------------------------------------------+ ==> | Node[1]{property2->"Kashmiri",property1->"Helen Harvey",type->"user"} | ==> | Node[2]{property2->"Afrikaans",property1->"Sean Matthews",type->"user"} | ==> | Node[3]{property2->"Haitian Creole",property1->"William Harper",type->"user"} | ==> | Node[4]{property2->"Macedonian",property1->"Bruce Hill",type->"user"} | ==> | Node[5]{property2->"Swahili",property1->"Chris Riley",type->"user"} | ==> +-------------------------------------------------------------------------------+ ==> 5 rows, 111 ms
[…] I don’t have fast hands, so instead of typing out a million node graph, we’ll build a graph generator and use the batch importer to load it into Neo4j. What I want to create is a set of files to feed […]