Multiple origin multiple destination 3 relationships queries for knowledge graphs using Neo4j

The multiple-origin-multiple-destination (MOMD) problem is an NP-Hard problem sometimes seen in logistics planning where paths can stretch out really far. A far simpler problem presents itself when we limit the size of the paths. Now you may be wondering, why would we do that? Well… outside logistics we have plenty of graphs where relevance drops as we get further and further away. Think about an Article on Wikipedia. It has links to many other articles that are relevant, and those have links to other articles that are relevant to them but less relevant to our starting Article, and those have links to other articles that may be relevant to them, but have very little to do with our starting Article. I think if we keep going we end up in Philosophy or something like that.

You’ve experienced this if it’s 3am and you don’t know how you ended up on this dark corner of the internet. So what’s our limit? Well that depends on your data, but if your limit happens to be 3 relationships, then you are in luck. Today I’m going to show you how to write a slightly crazy Cypher stored procedure to find the lowest weighted, shortest path up to 3 relationships away from multiple origin to multiple destinations. Before we begin, take a look at the following video to understand Knowledge Graphs a little better and where this problem presents itself:

The performance of a query in Neo4j depends on how much of the graph you traverse. The less of the graph we look at, the faster we go, regardless of the total size of the graph. There is a small thing to note here though… if you look at a small part of the graph multiple times, it counts as if you are looking at a big part of the graph. So in our query we want to avoid traversing a relationship more than once.

Before I give you the answer, let’s look at a few wrong ways to do this query. First, you might think I’ll just find the best path for each origin-destination pair one at a time. I think the heat death of the universe will arrive before this query finishes. Another approach may be to start with all the origin nodes and traverse outward 3 relationships and see if you run into any destination nodes along the way. This is also a bad idea as you will be exploring areas of the graph that have nothing to do with the destination nodes. So what do we do?

A bi-directional traversal. We start with the origin nodes and jump one hop, keeping track of where we landed as well as where we came from and checking if we landed in any of the destination nodes. Then from the destination nodes, we jump “backwards” (hopefully) toward our origin nodes and keep track of the nodes where we landed and how we got there as the “third” hop. We check to see if any of these are the origin nodes, or the nodes we found in the first hop. Now we can go from the set of nodes in the first or third hop and traverse one last time to see if they meet. You’d want to choose not the smallest set, but the set with the lowest degree. This is a spot where Neo4j can help us. For any node with more than 40 (configurable) relationships it keeps track of the degree of the node. But let’s assume we don’t have a densely connected graph, and just pick one of the sets to traverse.

At this point we are done traversing the graph, but we still need to take the completed paths, look at each set of origin-destination pairs and keep just the best one (measured as the lowest weighted shortest path). So you may be thinking, no problem, I’ll just collect them all and then filter at the end. Wrong, that’s not going to work. You’ll run out of memory. You need to keep just the best path for each pair as you go along. So how do we do that? Well we could use a HashMap and combine the ids of the two nodes as a String key and check it every time we have a connected path. Wrong, you’ll create so many Strings that Michio Kaku will write a book about it. I then tried squeezing two Node ids into a long and using a primitive optimized map. I got stuck here for a little while, but let me tell you the story of how I got past it.

I was in Montreal for a few days visiting family, when while checking Twitter I saw that Daniel Lemire was giving a talk at the University of Montreal.

I’m giving a talk on big data engineering (in French) downtown Montreal, Sept. 13th 10h30, at UQAM’s President Kennedy building, room PK-4610. Come and say hi!

— Daniel Lemire (@lemire) September 6, 2018

Daniel is known as a performance expert, so I figured he’d know what to do. I explained my problem, exchanged a couple of emails and he said that hashing is the wrong approach and I should be using Dictionary coding. Getting a value from a Hash is an O(1) operation, getting a value from an Array is also O(1), so what’s the difference? Well the O(1) is hiding the Constant. The cost of Hashing a value is more than multiplying a number by the size of each record as you would in an array.

A little trivia here folks, that is the way Neo4j works. It uses fixed sized record for nodes and relationships, so if you know the ID of a node or relationship you can jump to it immediately. Some other graph databases need to hash when traversing, and while that works fine when the amount of traversals is small, it will get worse and worse. Mind the constants.

But how do I use an array when the node IDs of my starting and ending node could be any number. I can’t just use a giant array, billions of records long, I’ll run out of memory and be back where I started! This is where the Dictionary comes in. We can’t change the node IDs of Neo4j, but when we make the first hop, instead of keeping track of the origin node id where we came from, we will convert it into a position in an origin list. Then when we make the “third” hop we will not keep track of the destination node, but rather convert it into a position in a destination list. The size of the array where we keep track of our best paths is just the number of origin nodes times the number of destination nodes. So if you have 100 of each, we’re talking about an array of size 10,000.

Maybe it’s time to show some code, here we build the dictionary (and a way to reverse it):

 
        // Build a dictionary for the first and fourth nodes
        Long2IntOpenHashMap fromDictionary = new Long2IntOpenHashMap();
        Long2IntOpenHashMap toDictionary = new Long2IntOpenHashMap();
        Int2LongOpenHashMap reverseFromDictionary = new Int2LongOpenHashMap();
        Int2LongOpenHashMap reverseToDictionary = new Int2LongOpenHashMap();
        int fromCounter = 0;
        for (Node fromNode : fromNodes) {
            fromDictionary.put(fromNode.getId(), fromCounter);
            reverseFromDictionary.put(fromCounter, fromNode.getId());
            fromCounter++;
        }
        int toCounter = 0;
        for (Node toNode : toNodes) {
            toDictionary.put(toNode.getId(), toCounter);
            reverseToDictionary.put(toCounter, toNode.getId());
            toCounter++;
        }

        int forNodeMultiplier = toNodes.size();

We can preload the arrays with the worst weights and lengths:

 
        // Store all values in O(1) access arrays
        double[] values = new double[fromNodes.size() * toNodes.size()];
        int[] lengths = new int[fromNodes.size() * toNodes.size()];
        Object[] bestPaths = new Object[fromNodes.size() * toNodes.size()];
        RoaringBitmap pathsFound = new RoaringBitmap();

        for(int i = 0; i < values.length; i++) {
            values[i] = 9999999.0d;
            lengths[i] = 99;
        }

We get our array location or “key” by multiplying the origin position times the number of destination nodes and adding the position of the destination node.

 
int key = (forNodeMultiplier * fromDictionary.get(nodeId)) + toDictionary.get(toNodeId);

Now we can check the weights and path lengths using an array instead of a hash and get our answer quickly. Now… you may be thinking, is this really necessary? There are some pretty fast performance focused maps in Java. Yes I know. I tried them, but when you have to do 100 Million gets, you won’t beat an Array. As always, the code is on github. I’m not going to go over every detail on here. It can probably be cleaned up a bit and there are more performance savings to be had, but I left my blood, sweat and tears there and I’m sick of looking at it. Send me a pull request if you come up with a faster version. If you have to hold your nose every time you look at Java code (specially mine), then do it. While Cypher has optimizations for 2 hop and shortest path traversals, this kind of 3-hop optimization hasn’t made it in yet. But it’s on the list.

In the meantime, give this stored procedure a shot. In one trial it returned about 4000 out of 85M paths in just 2 seconds. Anyway, Neo4j is amazing fast. If anybody tells you they can’t get their query to perform, send them my way. I love this stuff, and as I tell people: “we can always go faster”.

Update: I just realized, this trick will work finding the best 4 hop paths from 1 origin to many destinations (or vice versa) and best 5 hop paths from 1 origin to 1 destination.

2 thoughts on “Multiple origin multiple destination 3 relationships queries for knowledge graphs using Neo4j”

Gabi Toma says:

September 19, 2018 at 7:58 AM

+1 for referencing Michio Kaku

This Week in Neo4j - GraphConnect Keynote, Graphs4Good, APOC and Graph Algos Releases - Neo4j Graph Database Platform says:

September 22, 2018 at 11:27 PM

[…] De Marzi has written a blog post about the multiple-origin-multiple-destination (MOMD), and how to solve a simpler version of it […]

Max De Marzi

Graphs, Graphs, and nothing but the Graphs