In Mr. Rogers’ Neighborhood, the question “Won’t you be my neighbor?” is an invitation for somebody to be close to you. In graphs, it’s an invitation to traverse. The closest neighbors of a node are those reachable by a single relationship hop, but we can also consider nodes two, three or more hops away our neighbors as well. How can we find them in Neo4j? Using the “star”:

MATCH (b:Person{personId:'15751062'}) -[*1..4]-> (c) RETURN count(DISTINCT c);

The query above says find a Person with this personId and then traverse out 4 hops and return the count of all the distinct neighbors we find along the way. If we add the word PROFILE to that cypher query we can see what its doing:

It starts out by finding the first node using an index seek, and then it does a “VarLengthExpand(Pruning)” operation, counts the rows and outputs the result. In this test dataset, the result is 2,516,551 neighbors 4 hops out. That’s the size of a City, so this person has a ton of neighbors. It takes about 12 seconds to find all of these. What if we go out one more hop to 5 hops? In this test dataset, the result is 5,296,421 neighbors and it takes about a minute to complete. Yikes, what is it doing?

Well, the VarLengthExpand means expand a variable length path, easy enough. So what does the Pruning part that mean? It means we are pruning the source set of nodes at each hop, so we have fewer nodes to expand from. This is nice, but Cypher guarantees that all paths have unique relationships, so it keeps track of all the paths as it goes along. That’s too much work. Since all we want is the distinct neighbors, we can ignore keeping track of the paths and just keep track of the nodes we have already seen. How do we do this? Well with a stored procedure.

We will be using my favorite Java library RoaringBitmap for this trick. We will create 3 of them. One to keep track of all the nodes we have already seen. One to keep track of the nodes we need to traverse, and another to keep track of the nodes we reached in our last hop. So let’s start by adding our starting node to seen:

public Stream<NumberResult> neighbors(@Name("node") Node node, @Name("n") Number n) throws IOException { // Initialize bitmaps for iteration RoaringBitmap seen = new RoaringBitmap(); RoaringBitmap nextA = new RoaringBitmap(); RoaringBitmap nextB = new RoaringBitmap(); seen.add((int) node.getId());

Next we traverse from our starting node out and put all of the nodes found in nextB.

for (Relationship r : node.getRelationships(Direction.OUTGOING)) { nextB.add((int) r.getEndNodeId()); }

We don’t know how far we are going to go since the number of hops is a variable “n” so we will loop. First we prune all of the seen nodes from the set we are going to traverse, in this case nextB. We then add all of them to seen, and we clear nextA which is where the nodes found will go. Then for each node in nextB we traverse:

for(int i = 1; i < n.intValue(); i++) { // next even Hop nextB.andNot(seen); seen.or(nextB); nextA.clear(); for (Integer nodeId : nextB) { for (Relationship r : db.getNodeById((long) nodeId).getRelationships(Direction.OUTGOING)) { nextA.add((int) r.getEndNodeId()); } }

If the value of n is an odd number, we go ahead and traverse again in the same for loop after incrementing i, only the nextA and nextB are reversed:

i++; if (i < n.intValue()) { // next odd Hop nextA.andNot(seen); seen.or(nextA); nextB.clear(); for (Integer nodeId : nextA) { for (Relationship r : db.getNodeById((long) nodeId).getRelationships(Direction.OUTGOING)) { nextB.add((int) r.getEndNodeId()); } } }

At the end, we need to add either nextA or nextB to our seen list…

if((n.intValue() % 2) == 0) { seen.or(nextA); } else { seen.or(nextB); }

… remove our starting node, and then return the cardinality of seen which is our count:

seen.remove((int) node.getId()); return Stream.of(new NumberResult(seen.getCardinality()));

How well does this perform in our test database? Well for 4 hops, it finishes in 3.5 seconds which is better than the 12 we started with. For 5 hops, it finishes in 10 seconds, instead of the minute we had before. If we keep going, we get 6 hops in 16 seconds, 7 hops in 21 seconds, 8 hops in 22 seconds, 9 hops in 23 seconds, 10 hops in 24 seconds… Why does it seem to level off? Well, we are running out of nodes to traverse. Anytime we run into a node we have already seen, we ignore it. Our cypher query doesn’t fare so well as it has to keep track of all the paths and eats up a ton of memory doing so.

Long time readers may remember a similar trick I used at the bottom of this blog post. In that case I was keeping track of the number of distinct nodes found at different distances. If you have a need for this kind of traversal it’s available on github on this repository, but those willing to wait a bit will find them soon added to APOC.