If you are friends with Jessie, and Jessie is friends with Amy, there is a good chance you’ll eventually become friends with Amy too. In terms of a graph, this would be like a graph with three nodes and two relationships eventually building a third relationship to form a clique. This simple concept is one of the basis for recommendation engines. There are fancy terms for it, like “triadic closure” but basically it just means we are making triangles. But what about Amy’s friend Delilah? Is there a good chance now that you are friends with Amy that you’ll become friends with her? What about Jessie and Delilah? Can we extend the pattern to four nodes or five nodes and go beyond our simple triangle?

These distinctive patterns are called Motifs. I ran into a paper from Ghadeer Abuoda and friends titled “Link Prediction via HigherOrderMotif Features” on using these patterns to make better predictions(duh) and it got my interest. Neo4j’s query language Cypher is specifically designed to find patterns in the graph. Let’s take a look at some of these motifs and how we can represent them as a Cypher query. We’ll start with M3.1:

The cypher is: `MATCH (n1)-[r1]-(n2)-[r2]-(n3)`

Pretty straight forward. Now with M3.2, we need to close the triangle, in Cypher we do this by repeating the first node in our pattern at the end.

The cypher is: `MATCH (n1)-[r1]-(n2)-[r2]-(n3)-[r3]-(n1)`

Let’s move on to 4 node motifs and try 4.2:

Ok we have a problem here. Cypher is written in a single dimension, and we can only connect a node to two other nodes, but in this motif node 2 needs to connect to both node 1, node 3 and node 4! So what can we do? Well in Cypher we can use the comma as an “AND” to signify that the pattern continues. The cypher in this case is: `MATCH (n1)-[r1]-(n2)-[r2]-(n3), (n2)-[r3]-(n4)`

So far so good, let’s move on to 5 nodes. Let’s try 5.8:

The pattern is pretty crazy, but we can actually write the whole thing without using any commas in this case: `MATCH (n1)-[r1]-(n3)-[r2]-(n4)-[r3]-(n5)-[r4]-(n1)-[r5]-(n4)-[r6]-(n2)-[r7]-(n5)-[r8]-(n3)-[r9]-(n2)`

Ok that’s neat and all but how do these help with recommendations? Well we can follow the same path Mark Needham and Amy Hodler walk us through on their book “Graph Algorithms” on Chapter 8. You can get a free digital version of the book by following this link.

We are going to trace the same path but instead of calculating “common authors”, “preferential attachment” and “total neighbors”, we’ll try calculating a motif. For this case, we will try Motif 4.1.

def apply_motif_41_training_feature(data): query = """ UNWIND $pairs AS pair MATCH (n1) WHERE id(n1) = pair.node1 MATCH (n4) WHERE id(n4) = pair.node2 RETURN pair.node1 AS node1, pair.node2 AS node2, size([(n1)-[r1:CO_AUTHOR_EARLY]-(n2)-[r2:CO_AUTHOR_EARLY]-(n3)-[r3:CO_AUTHOR_EARLY]-(n4) WHERE n1 <> n3 AND n1 <> n4 AND n2 <> n4 | n1]) AS motif_41 """ pairs = [{"node1": row["node1"], "node2": row["node2"]} for row in data.collect()] features = spark.createDataFrame(graph.run(query, {"pairs": pairs}).to_data_frame()) return data.join(features, ["node1", "node2"])

We will do the same for the testing feature except use the CO_AUTHOR relationship type instead of CO_AUTHOR_EARLY. Then we’ll apply the feature to the training data.

training_data = apply_motif_41_training_feature(training_data) test_data = apply_motif_41_testing_feature(test_data)

Next, we’ll train our model:

motif_41_model = train_model(["motif_41"], training_data)

Evaluate it and display it’s results:

motif_41_results = evaluate_model(motif_41_model, test_data) display_results(motif_41_results)

…and here are our results:

Measure Score 0 accuracy 0.936013 1 recall 0.872027 2 precision 1.000000

Our model is at about 94%. Not bad for using just a single graph feature, “Motif 4.1” which in our case means finding the Authors who are 3 relationships away from us in 2006 and end up becoming co-authors later in time. You can see the Jupyter notebook on Anaconda. Anyway, feel free to experiment and try adding graph features to your models.