Neo4j is faster than MySQL in performing recursive query

5mysql

A user on StackOverflow was wondering about the performance between Neo4j and MySQL for performing a recursive query. They started with Neo4j performing the query in 240 seconds. Then an optimized cypher query got them down to 40 seconds. Then I got them down to…

Before the reveal, let’s perform an autopsy on the query and see what is going on. The model is a simple (c:Customer)-[f:FRIEND_OF]->(c2:Customer). There are a total of 100k Customer nodes and each has 100 relationships, for a total of 10M relationships. The query is supposed to start at a Customer Node, and traverse out 4 levels, returning the distinct CustomerIds of every node along the way.

The initial query looked like this:

MATCH (c:Customer{CustomerID:99})-[:FRIEND_OF*4]->(cc:Customer)
RETURN DISTINCT cc.CustomerID;

That’s a pretty obvious starting point, but Neo4j is performing more work than it needs to. That Match statement is holding on to whole paths 4 levels out. It doesn’t realize that we just want a single property distinctly from all those nodes.

Let me give you another example. Find all messages created by my friends, or friends of friends, or friends of friends of friends, etc 4 levels out, return their author, the content and order them by date. We could write the query like this:

MATCH (:Person {id:{personId}})-[:KNOWS*1..4]->(friend:Person)<-[:HAS_CREATOR]-(message)
WHERE message.creationDate < {date}
RETURN DISTINCT message, friend
ORDER BY message.creationDate DESC, message.id ASC
LIMIT {top}

But what’s the problem? Well, it’s doing way too much work. Not only is it finding and keeping track of all the KNOWS paths 4 levels out, it is going one further to their messages. That’s a lot of paths, and it could be traversing the HAS_CREATOR relationship over and over again if friends of friends are friends with our friends. Say that 3 times fast. So what’s a better way to write that query:

MATCH (:Person {id:{personId}})-[:KNOWS*1..4]->(friend:Person)
WITH COLLECT (DISTINCT friend) AS friends
UNWIND friends AS friend
MATCH (friend:Person)<-[:HAS_CREATOR]-(message)
WHERE message.creationDate < {date}
WITH DISTINCT message
ORDER BY message.creationDate DESC, message.id ASC
LIMIT {top}
MATCH (friend:Person)<-[:HAS_CREATOR]-(message)
RETURN message, friend

Right. So now we just get the distinct friends and then traverse to the messages of that friend just once. Andres Taylor wanted the query planner to be able to make that optimization on its own. So he and Johan Teleman are addressing this in Neo4j 3.2 with the “PruningVarExpander“. This rewriter will rewrite the query plans of these types of queries:

match (a)-[*1..3]->(b) return distinct b
match (from)-[*1..3]->(to) return count(distinct to)
match (a)-[*1..3]->(b:X) return distinct b
match (a)-[:R*1..3]->(b) with count(*) as count return distinct count
match (a)-[:R*1..3]->(b)-[:T*1..3]->(c) return distinct c
match (a)-[:R*1..3]->(b)-[:T]->(c) return distinct c
match (a) optional match (a)-[:R*1..3]->(b)-[:T]->(c) return distinct c

The common theme is that these queries do not care about the paths along the way, they just care about distinct nodes, properties or counts. However the next query is one we can’t rewrite with this rewriter:

match (a)-[*1..3]->(b) return b, count(*)

Because it is asking for the count of all paths leading to a node “b”, so the paths matter. Since we don’t have 3.2 out just yet (there are alphas available, but I tend to wait until the first milestone comes out before I get my hands dirty) we can rewrite that query like so:

match (c:Customer{CustomerID:99})-[:FRIEND_OF]->(c1)-[:FRIEND_OF]->(c2)
with distinct c2
match (c2)-[:FRIEND_OF]->(c3)
with distinct c3
match (c3)-[:FRIEND_OF]->(cc:Customer)
with distinct cc
return cc.CustomerID;

According to the poster, this updated query took Neo4j from 240 seconds to 40 seconds, which is much better than before, but they also rewrote their SQL query to make use of the same idea and got that down to 24 seconds.

Now, I’ve seen a very similar query before… and I wrote up a blog post about it. Basically there were two ways to speed this up. One by modifying how Neo4j saved temporary unique nodes as it traversed. The other doing what I called the BitSet Dance. Turns out there is a simpler way of doing it, that yields the same performance. I am going to use RoaringBitMaps instead of OpenBitSet but it’s the same idea.

Instead of having an array of bitmaps, I just have 3. One that keeps track globally of all the nodes I’ve seen and two that alternate between newly found nodes, and nodes we are traversing from.

        RoaringBitmap seen = new RoaringBitmap();
        RoaringBitmap nextA = new RoaringBitmap();
        RoaringBitmap nextB = new RoaringBitmap();

We will use RoaringBitmaps “checkedAdd” function to see if the node id had already been seen, and if it wasn’t we’ll add it to one of our next Iterators.

        while (relationshipIterator.hasNext()) {
            c = ops.relationshipCursor(relationshipIterator.next());
            c.next();
            int nodeId = (int)c.get().endNode();
            if(seen.checkedAdd(nodeId)) {
                nextA.add(nodeId);
            }
        }

We do the same thing for every level, now iterating on “nextA” to populate “nextB”.

        iterator = nextA.iterator();
        while (iterator.hasNext()) {
            relationshipIterator = ops.nodeGetRelationships((long)iterator.next(), Direction.OUTGOING);
            while (relationshipIterator.hasNext()) {
                c = ops.relationshipCursor(relationshipIterator.next());
                c.next();
                int nodeId = (int)c.get().endNode();
                if(seen.checkedAdd(nodeId)) {
                    nextB.add(nodeId);
                }
            }
        }

Next iterating on “nextB” to populate “nextA”, etc. Then at the end we want to convert all these node ids into the property that we are interested in. In our case it’s the CustomerID property so we return a stream of them:

        return StreamSupport.stream(seen.spliterator(),false).
                map(nodeId -> {
                    try {
                        return new StringResult((String) ops.nodeGetProperty(nodeId, propertyCustomerIDkey));
                    } catch (EntityNotFoundException e) {
                        e.printStackTrace();
                        return null;
                    }
                });

Let’s try it. We’ll add our stored procedure to a Neo4j 3.1.1 instance and populate it with some data. First we’ll create 100k users with a query you’ve seen before:

WITH ['Jennifer','Michelle','Tanya','Julie','Christie',
      'Sophie','Amanda','Khloe','Sarah','Kaylee'] AS names 
        FOREACH (r IN range(0,100000) | CREATE (:Customer {CustomerID:names[r % size(names)]+r}))

Next we will add the relationships. Run this query 10 times. Each run will create 1 Million relationships randomly.

        MATCH (u1:Customer),(u2:Customer)
        WITH u1,u2
        LIMIT 10000000
        WHERE rand() < 0.1
        CREATE (u1)-[:FRIEND_OF]->(u2);

Now we have our 100k nodes and 10M relationships. So what kind of performance do we get on this 4 year old laptop when we try it?

MATCH (c:Customer {CustomerID:'Jennifer0'}) 
CALL com.maxdemarzi.distinct_network(c) YIELD value 
RETURN value

screen-shot-2017-02-06-at-7-13-46-pm

2.7 seconds. Now that’s what I’m talking about. Eat it MySQL. But don’t take my word for it. Try it yourself. Code is on github as always. Now we could go even faster if we multi-threaded the traversal. I’ve already shown you how to do that, so I’ll leave it as an exercise for the reader.

Also remember that this is a very tiny graph. 1 Million nodes and 10 Million relationships, heading 4 levels deep. Try this with 100 Million nodes, or going 25 levels deep. MySQL, Postgres, Oracle Exadata, all of them will bust and Neo4j will keep on going. Listen to this presentation to hear more.

Tagged , , , , , , , , , , ,

3 thoughts on “Neo4j is faster than MySQL in performing recursive query

  1. steve says:

    Great post, thanks! I was just running these types of queries and noticed this issue myself. I am kind of surprised this hasn’t been addressed until the upcoming 3.2 release given how long Neo4j has been out.

    Anyway, I had a quick question about your code. Take the query you have at the start, which is returning customerIDs from a start node up to depth 4:

    MATCH (c:Customer{CustomerID:99})-[:FRIEND_OF*4]->(cc:Customer)
    RETURN DISTINCT cc.CustomerID;

    Your optimized code is as follows:

    match (c:Customer{CustomerID:99})-[:FRIEND_OF]->(c1)-[:FRIEND_OF]->(c2)
    with distinct c2
    match (c2)-[:FRIEND_OF]->(c3)
    with distinct c3
    match (c3)-[:FRIEND_OF]->(cc:Customer)
    with distinct cc
    return cc.CustomerID;

    However, aren’t these doing different things? It seems to me like the optimized code would only return the customerIDs at depth=4, not up to 4. Am I understanding this wrong?

    • InverseFalcon says:

      You may have misread the first query. The [:FRIEND_OF*4] means it will only look at cc nodes at exactly depth 4, just like the optimized code. If we wanted to match cc from depth 1 up to depth 4 instead, we would use [:FRIEND_OF*..4]

      • steve says:

        Yeah I missed that, good point. If we wanted to return the cc’s from 1 to 4, would the best way to do it be keeping a collection that we continually add to? Then at the end, unwind that collection and return the distinct elements?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: