Modeling Events in Neo4j

No. Not modeling events, I’m talking about modeling events. Things that happen at different times typically in some known sequence. If you are a long time follower of my blog you know I love promoting the date property of an event into the relationship type to make use of Neo4j’s individual Node-RelationshipType partitioning to speed up my queries, but I’m going to show you something different today.

About two years ago Instacart open sourced “The Instacart Online Grocery Shopping Dataset 2017”, Accessed from https://www.instacart.com/datasets/grocery-shopping-2017 which we will use today. Read their blog post on the subject for more information. So what does this data look like:

Since this is anonymized data, we don’t have the actual dates on when these orders happened, but we do have a sequence number for each order allowing us to know what a User bought first, and next, and so on. We will augment our model with this information by adding previous relationships between orders.

Let’s start off by creating some Schema by running these one at a time:

 
    CREATE CONSTRAINT ON (n:Aisle) ASSERT n.id IS UNIQUE;
    CREATE CONSTRAINT ON (n:Aisle) ASSERT n.name IS UNIQUE;
    CREATE CONSTRAINT ON (n:Department) ASSERT n.id IS UNIQUE;
    CREATE CONSTRAINT ON (n:Department) ASSERT n.name IS UNIQUE;                
    CREATE CONSTRAINT ON (n:Product) ASSERT n.id IS UNIQUE;
    CREATE CONSTRAINT ON (n:Product) ASSERT n.name IS UNIQUE;  
    CREATE CONSTRAINT ON (n:User) ASSERT n.id IS UNIQUE;  
    CREATE CONSTRAINT ON (n:Order) ASSERT n.id IS UNIQUE;                
    CREATE INDEX ON :Order(user_id, number);

We need to do a bit of data cleanup:

 
Replace: \"" with \" in the products.csv file, and copy everything into the "import" directory within Neo4j.

And now we can import some of this data into Neo4j. Run the following one at a time, mind you some will take a little while to finish:

 
    LOAD CSV WITH HEADERS FROM 'file:///aisles.csv' AS csvLine
    CREATE (a:Aisle {id: toInteger(csvLine.aisle_id), name:csvLine.aisle});
    
    LOAD CSV WITH HEADERS FROM 'file:///departments.csv' AS csvLine
    CREATE (d:Department {id: toInteger(csvLine.department_id), name:csvLine.department});
    
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM 'file:///products.csv' AS csvLine
    CREATE (p:Product {id: toInteger(csvLine.product_id), name:csvLine.product_name});
    
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM 'file:///products.csv' AS csvLine
    MATCH  (p:Product {id: toInteger(csvLine.product_id)}), (a:Aisle {id: toInteger(csvLine.aisle_id)})
    CREATE (p)-[:IN_AISLE]->(a);
    
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM 'file:///products.csv' AS csvLine
    MATCH  (d:Department {id: toInteger(csvLine.department_id)}), (a:Aisle {id: toInteger(csvLine.aisle_id)})
    MERGE (a)-[:IN_DEPARTMENT]->(d);
    
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM 'file:///orders.csv' AS csvLine
    WITH DISTINCT csvLine.user_id AS user_id
    MERGE (u:User {id: toInteger(user_id)});

    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM 'file:///orders.csv' AS csvLine
    CREATE (o:Order {id: toInteger(csvLine.order_id), set: csvLine.eval_set, user_id: toInteger(csvLine.user_id), number: toInteger(csvLine.order_number),
            dow: toInteger(csvLine.order_dow), hour: toInteger(csvLine.order_hour_of_day) });
    
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM 'file:///orders.csv' AS csvLine
    MATCH  (u:User {id: toInteger(csvLine.user_id)}), (o:Order {id: toInteger(csvLine.order_id)})
    CREATE (u)<-[:ORDERED_BY]-(o);
    
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM 'file:///orders.csv' AS csvLine
    MATCH  (o:Order {id: toInteger(csvLine.order_id)}), (po:Order {user_id: toInteger(csvLine.user_id), number: toInteger(csvLine.order_number) - 1})
    CREATE (o)-[:PREV]->(po);
    
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM 'file:///order_products__prior.csv' AS csvLine
    MATCH  (o:Order {id: toInteger(csvLine.order_id)}), (p:Product {id: toInteger(csvLine.product_id)})
    CREATE (o)-[:HAS {order: toInteger(csvLine.add_to_cart_order), reordered: toBoolean(csvLine.reordered)} ]->(p);
    
    USING PERIODIC COMMIT
    LOAD CSV WITH HEADERS FROM 'file:///order_products__train.csv' AS csvLine
    MATCH  (o:Order {id: toInteger(csvLine.order_id)}), (p:Product {id: toInteger(csvLine.product_id)})
    CREATE (o)-[:HAS {order: toInteger(csvLine.add_to_cart_order), reordered: toBoolean(csvLine.reordered)} ]->(p);

Once we import this data, we will be able to see chains of orders:

 
MATCH (n:Order) RETURN n LIMIT 25

Ok, let’s explore this dataset. How many Aisles do we have?

 
MATCH (n:Aisle) 
RETURN COUNT(n)

134. That is one way to look at the Product Breadth. Let’s look at Product Depth by seeing how much variety we have in those aisles by finding the number of products of each Aisle.

 
MATCH (n:Aisle) 
RETURN n.name, SIZE((n)--()) AS number_of_products
ORDER BY number_of_products DESC 

Our answer below shows us we have a bunch that don’t really have one, tons of chocolate, ice cream, vitamin supplements, etc and at the bottom bulk dried fruits and vegetables.

 
╒═══════════════════════════════╤════════════════════╕
│"n.name"                       │"number_of_products"│
╞═══════════════════════════════╪════════════════════╡
│"missing"                      │1259                │
├───────────────────────────────┼────────────────────┤
│"candy chocolate"              │1247                │
├───────────────────────────────┼────────────────────┤
│"ice cream ice"                │1092                │
├───────────────────────────────┼────────────────────┤
│"vitamins supplements"         │1039                │
├───────────────────────────────┼────────────────────┤
│"yogurt"                       │1027                │
├───────────────────────────────┼────────────────────┤
│"chips pretzels"               │990                 │
├───────────────────────────────┼────────────────────┤
│"tea"                          │895                 │
├───────────────────────────────┼────────────────────┤
...
├───────────────────────────────┼────────────────────┤
│"packaged produce"             │33                  │
├───────────────────────────────┼────────────────────┤
│"bulk grains rice dried goods" │27                  │
├───────────────────────────────┼────────────────────┤
│"bulk dried fruits vegetables" │13                  │
└───────────────────────────────┴────────────────────┘

Ok, let’s now look at just one product. I’m going to pick Organic Bananas since I buy these all the time.

 
MATCH (productA:Product {name: "Bag of Organic Bananas"})
RETURN productA

We have an id property of 13176, from here on out we will use that to make the queries shorter.

 
╒════════════════════════════════════════════╕
│"productA"                                  │
╞════════════════════════════════════════════╡
│{"name":"Bag of Organic Bananas","id":13176}│
└────────────────────────────────────────────┘

What can we learn about this product?

 
// What Aisle is it in?
MATCH (productA:Product {id: 13176})-[:IN_AISLE]->(aisle)
RETURN productA, aisle
 
// What other products are in that Aisle?
MATCH (productA:Product {id: 13176})-[:IN_AISLE]->(aisle)<-[:IN_AISLE]-(productB)
RETURN productA, aisle, productB
 
// How many other products are in that Aisle?
MATCH (productA:Product {id: 13176})-[:IN_AISLE]->(aisle:Aisle)
RETURN productA, aisle, SIZE((aisle)<-[:IN_AISLE]-()) AS number_of_products
 
// Frequently Ordered With a "Bag of Organic Bananas"
MATCH (productA:Product {id: 13176})<-[:HAS]-(order)-[:HAS]->(productB)
RETURN productB, COUNT(*)
ORDER BY COUNT(*) DESC
 
// Frequently Ordered With a "Bag of Organic Bananas" and what aisle?
MATCH (productA:Product {id: 13176})<-[:HAS]-(order)-[:HAS]->(productB)-[:IN_AISLE]->(aisle)
RETURN productB, aisle, COUNT(*)
ORDER BY COUNT(*) DESC
 
// Frequently Ordered With a "Bag of Organic Bananas" in the same aisle?
MATCH (aisle)<-[:IN_AISLE]-(productA:Product {id: 13176})<-[:HAS]-(order)-[:HAS]->(productB)-[:IN_AISLE]->(aisle)
RETURN productB, COUNT(*)
ORDER BY COUNT(*) DESC
 
// Counts of Frequently Ordered With Products to a "Bag of Organic Bananas"
MATCH (aisle)<-[:IN_AISLE]-(productA:Product {id: 13176})<-[:HAS]-(order)-[:HAS]->(productB)-[:IN_AISLE]->(aisle)
RETURN productA, SIZE((productA)-[:HAS]-()) AS productAOrders, productB, SIZE((productB)-[:HAS]-()) AS productBOrders, COUNT(*) AS both
ORDER BY both DESC
 
// Score of Frequently Ordered With Products to a "Bag of Organic Bananas"
MATCH (aisle)<-[:IN_AISLE]-(productA:Product {id: 13176})<-[:HAS]-(order)-[:HAS]->(productB)-[:IN_AISLE]->(aisle)
WITH productA, SIZE((productA)-[:HAS]-()) AS productAOrders, productB, SIZE((productB)-[:HAS]-()) AS productBOrders, COUNT(*) AS both
RETURN productA.name, productB.name, both/(1.0 * (productAOrders + productBOrders - both)) AS score
ORDER BY score DESC

This query takes a few seconds to run, but let’s go ahead and look at the results. People who buy a Bag of Organic Bananas tend to also purchase Organic Hass Avocados and Organic Strawberries. That kinda makes sense. What is the product in the same Aisle that people who buy a Bag of Organic Bananas purchases the least…

 
╒════════════════════════╤═══════════════════════════════════════════╤════════════════════════╕
│"productA.name"         │"productB.name"                            │"score"                 │
╞════════════════════════╪═══════════════════════════════════════════╪════════════════════════╡
│"Bag of Organic Bananas"│"Organic Hass Avocado"                     │0.11752376389629904     │
├────────────────────────┼───────────────────────────────────────────┼────────────────────────┤
│"Bag of Organic Bananas"│"Organic Strawberries"                     │0.10680334431046294     │
├────────────────────────┼───────────────────────────────────────────┼────────────────────────┤
│"Bag of Organic Bananas"│"Organic Large Extra Fancy Fuji Apple"     │0.05424507800084253     │
├────────────────────────┼───────────────────────────────────────────┼────────────────────────┤
│"Bag of Organic Bananas"│"Apple Honeycrisp Organic"                 │0.05325903905255133     │
├────────────────────────┼───────────────────────────────────────────┼────────────────────────┤
│"Bag of Organic Bananas"│"Organic Lemon"                            │0.05067739140018715     │
├────────────────────────┼───────────────────────────────────────────┼────────────────────────┤
...
├────────────────────────┼───────────────────────────────────────────┼────────────────────────┤
│"Bag of Organic Bananas"│"Bananas"                                  │0.0000024973216225598046│
└────────────────────────┴───────────────────────────────────────────┴────────────────────────┘

No surprise here, it was regular Bananas. Maybe if you needed lots of bananas and the store ran out of the organic ones, you would buy the regular ones, but typically you pick one or the other. So people who tend to buy Organic Bananas tend to also buy other Organic Produce from the same Aisle, makes sense. What else can we learn from this dataset? What about products that you tend to buy only one thing from an aisle?

 
// Like Pasta Sauce?
MATCH (n:Aisle {name:"pasta sauce"})<-[:IN_AISLE]-(productA)
RETURN n, productA
 
// Let's focus on one item in the pasta sauce
MATCH (productA:Product {name:"Tomato Basil Pasta Sauce"})
RETURN  productA
 
// How many times have people purchased this Product?
MATCH (productA:Product {name:"Tomato Basil Pasta Sauce"})
RETURN  productA, SIZE((productA)<-[:HAS]-()) AS ordered

The answer is 1213 btw. Keep this in mind.

 
// Are there Frequently Ordered With Products in the same Aisle as "Tomato Basil Pasta Sauce" ?
MATCH (aisle)<-[:IN_AISLE]-(productA:Product {id: 48341})<-[:HAS]-(order)-[:HAS]->(productB)-[:IN_AISLE]->(aisle)
WITH productA, SIZE((productA)-[:HAS]-()) AS productAOrders, productB, SIZE((productB)-[:HAS]-()) AS productBOrders, COUNT(*) AS both
RETURN productA.name, productB.name, both/(1.0 * (productAOrders + productBOrders - both)) AS score
ORDER BY score DESC

Let’s look at the results and comparing it with the Organic Bananas. Avocados and Strawberries had a score of 0.11 and 0.10. The best frequently ordered with score for Tomato Basil Pasta Sauce is Garlic Marinara Pasta Sauce at 0.01. Users tend to purchase one type of pasta sauce at a time.

 
╒══════════════════════════╤═══════════════════════════════════════════╤══════════════════════╕
│"productA.name"           │"productB.name"                            │"score"               │
╞══════════════════════════╪═══════════════════════════════════════════╪══════════════════════╡
│"Tomato Basil Pasta Sauce"│"Garlic Marinara Pasta Sauce"              │0.012709710218607015  │
├──────────────────────────┼───────────────────────────────────────────┼──────────────────────┤
│"Tomato Basil Pasta Sauce"│"Spicy Marinara Pasta Sauce"               │0.006402048655569782  │
├──────────────────────────┼───────────────────────────────────────────┼──────────────────────┤
│"Tomato Basil Pasta Sauce"│"Organico Bello Tomato Basil Pasta Sauce"  │0.0036380172805820826 │
├──────────────────────────┼───────────────────────────────────────────┼──────────────────────┤
│"Tomato Basil Pasta Sauce"│"Vodka Pasta Sauce"                        │0.003054989816700611  │
├──────────────────────────┼───────────────────────────────────────────┼──────────────────────┤
...

Now we get to take advantage of the previous order relationships we created earlier to trace the sequence of orders and see what our users bought after buying Tomato Basil Pasta Sauce.

 
// What Products did people who bought "Tomato Basil Pasta Sauce" purchase in their next order from the same aisle?
MATCH (aisle)<-[:IN_AISLE]-(productA:Product {id: 48341})<-[:HAS]-(order)<-[:PREV]-(nextOrder)-[:HAS]->(productB)-[:IN_AISLE]->(aisle)
RETURN productB, COUNT(*) 
ORDER BY COUNT(*) DESC

Interesting. Garlic Marinara and Basil Pesto come out on top:

 
╒══════════════════════════════════════════════════════════════════════╤══════════╕
│"productB"                                                            │"COUNT(*)"│
╞══════════════════════════════════════════════════════════════════════╪══════════╡
│{"name":"Garlic Marinara Pasta Sauce","id":26695}                     │14        │
├──────────────────────────────────────────────────────────────────────┼──────────┤
│{"name":"Basil Pesto","id":12206}                                     │14        │
├──────────────────────────────────────────────────────────────────────┼──────────┤
│{"name":"Marinara Pasta Sauce","id":38273}                            │11        │
├──────────────────────────────────────────────────────────────────────┼──────────┤
...

But that was just the immediate next order, what if they still had some Tomato Basil Pasta Sauce left and didn’t really need to buy more pasta sauce in general. Lets go to the next 3 orders by using the magical * in Cypher which means “keep going” and limiting it to 1..3:

 
// What about in the last 3 orders?
MATCH (aisle)<-[:IN_AISLE]-(productA:Product {id: 48341})<-[:HAS]-(order)<-[:PREV*1..3]-(nextOrder)-[:HAS]->(productB)-[:IN_AISLE]->(aisle)
RETURN productB, COUNT(*) 
ORDER BY COUNT(*) DESC
 
// // What about all of the previous orders?
MATCH (aisle)<-[:IN_AISLE]-(productA:Product {id: 48341})<-[:HAS]-(order)<-[:PREV*]-(nextOrder)-[:HAS]->(productB)-[:IN_AISLE]->(aisle)
RETURN productB, COUNT(*) 
ORDER BY COUNT(*) DESC

Looks like it’s neck and neck between Marinara Pasta sauce and the Organic Tomato Basil Pasta sauce.

 
╒══════════════════════════════════════════════════════════════════════╤══════════╕
│"productB"                                                            │"COUNT(*)"│
╞══════════════════════════════════════════════════════════════════════╪══════════╡
│{"name":"Marinara Pasta Sauce","id":38273}                            │206       │
├──────────────────────────────────────────────────────────────────────┼──────────┤
│{"name":"Organic Tomato Basil Pasta Sauce","id":39619}                │205       │
├──────────────────────────────────────────────────────────────────────┼──────────┤
│{"name":"Basil Pesto","id":12206}                                     │205       │
├──────────────────────────────────────────────────────────────────────┼──────────┤
│{"name":"Garlic Marinara Pasta Sauce","id":26695}                     │142       │
├──────────────────────────────────────────────────────────────────────┼──────────┤
│{"name":"Organico Bello Tomato Basil Pasta Sauce","id":8088}          │135       │
├──────────────────────────────────────────────────────────────────────┼──────────┤

That’s a bit weird isn’t it. Why doesn’t the regular Tomato Basil Pasta sauce show up in our result set? Ah… well you see in Cypher we can only traverse a relationship ONCE in a MATCH clause. This part:

 
MATCH (aisle)<-[:IN_AISLE]-(productA:Product {id: 48341})... (productB)-[:IN_AISLE]->(aisle)

…is causing us to filter out the original product. So what can we do?

We can move it to the WHERE clause.

 
MATCH (aisle)<-[:IN_AISLE]-(productA:Product {id: 48341})<-[:HAS]-(order)<-[:PREV*]-(nextOrder)-[:HAS]->(productB)
WHERE (productB)-[:IN_AISLE]->(aisle)
RETURN productB, COUNT(*) 
ORDER BY COUNT(*) DESC

Now things make more sense as people tend to repurchase the same item 15x more than purchasing something else.

 
╒══════════════════════════════════════════════════════════════════════╤══════════╕
│"productB"                                                            │"COUNT(*)"│
╞══════════════════════════════════════════════════════════════════════╪══════════╡
│{"name":"Tomato Basil Pasta Sauce","id":48341}                        │3159      │
├──────────────────────────────────────────────────────────────────────┼──────────┤
│{"name":"Marinara Pasta Sauce","id":38273}                            │206       │
├──────────────────────────────────────────────────────────────────────┼──────────┤
│{"name":"Organic Tomato Basil Pasta Sauce","id":39619}                │205       │
├──────────────────────────────────────────────────────────────────────┼──────────┤
│{"name":"Basil Pesto","id":12206}                                     │205       │
├──────────────────────────────────────────────────────────────────────┼──────────┤
│{"name":"Garlic Marinara Pasta Sauce","id":26695}                     │142       │
├──────────────────────────────────────────────────────────────────────┼──────────┤
...

But what a minute how did we get 3159? That’s more than the actual purchases (1213) of the Tomato Basil Pasta Sauce! Why is that? Oh because we are counting Paths of variable lengths and if a customer bought the same sauce multiple times it would multiply the number of paths even though the orders only go up by one. Instead let us count the number of distinct orders where this happened instead.

 
MATCH (aisle)<-[:IN_AISLE]-(productA:Product {id: 48341})<-[:HAS]-(order)<-[:PREV*]-(nextOrder)-[:HAS]->(productB)
WHERE (productB)-[:IN_AISLE]->(aisle)
WITH productA, productB, COUNT(DISTINCT order) AS orders
RETURN productA.name, productB.name, orders
ORDER BY orders DESC
 
╒══════════════════════════╤════════════════════════════════════════╤════════╕
│"productA.name"           │"productB.name"                         │"orders"│
╞══════════════════════════╪════════════════════════════════════════╪════════╡
│"Tomato Basil Pasta Sauce"│"Tomato Basil Pasta Sauce"              │611     │
├──────────────────────────┼────────────────────────────────────────┼────────┤
│"Tomato Basil Pasta Sauce"│"Basil Pesto"                           │95      │
├──────────────────────────┼────────────────────────────────────────┼────────┤
│"Tomato Basil Pasta Sauce"│"Organic Tomato Basil Pasta Sauce"      │92      │
├──────────────────────────┼────────────────────────────────────────┼────────┤
...

Let’s try to get a score to make more sense of this:

 
// Score of Frequently Ordered NEXT Products to a "Tomato Basil Pasta Sauce"
MATCH (aisle)<-[:IN_AISLE]-(productA:Product {id: 48341})<-[:HAS]-(order)<-[:PREV*]-(nextOrder)-[:HAS]->(productB)
WHERE (productB)-[:IN_AISLE]->(aisle)
WITH productA, SIZE((productA)-[:HAS]-()) AS productAOrders, productB, SIZE((productB)-[:HAS]-()) AS productBOrders, COUNT(DISTINCT order) AS orders
RETURN productA.name, productB.name, orders/(1.0 * (productAOrders + productBOrders - orders)) AS score
ORDER BY score DESC

… and here we go:

 
╒══════════════════════════╤══════════════════════════════════════════════╤══════════════════════╕
│"productA.name"           │"productB.name"                               │"score"               │
╞══════════════════════════╪══════════════════════════════════════════════╪══════════════════════╡
│"Tomato Basil Pasta Sauce"│"Tomato Basil Pasta Sauce"                    │0.33663911845730027   │
├──────────────────────────┼──────────────────────────────────────────────┼──────────────────────┤
│"Tomato Basil Pasta Sauce"│"Garlic Marinara Pasta Sauce"                 │0.02945736434108527   │
├──────────────────────────┼──────────────────────────────────────────────┼──────────────────────┤
│"Tomato Basil Pasta Sauce"│"Organico Bello Tomato Basil Pasta Sauce"     │0.018928901200369344  │
├──────────────────────────┼──────────────────────────────────────────────┼──────────────────────┤
│"Tomato Basil Pasta Sauce"│"Spicy Marinara Pasta Sauce"                  │0.015503875968992248  │
├──────────────────────────┼──────────────────────────────────────────────┼──────────────────────┤
...

As you can see, using the previous order relationships to build sequences of events make these kind of queries super easy to write. The Event model is a great pattern to know, you can learn more about it in a different use case. It’s so good you may even call it a Super Model.

Tagged , , , , , , , , , ,

One thought on “Modeling Events in Neo4j

  1. Hi Max,
    I think when you are passing the reordered property [1/0] from the csv it will catch null for all products as in cypher toBoolean needs an integer =”True”/”true” as input.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: