Je n’ai fait celle-ci plus longue que parce que je n’ai pas eu le loisir de la faire plus courte. I would have written a shorter letter, but I did not have the time.Written by Blaise Pascal is often misattributed to Mark Twain. It reminds us to try to be brief. Too many people never learn this.
Valentine’s Day was earlier this week, maybe you took your significant other to dinner, sent flowers or candy to your crush, even bought a card for that special someone. I bet however you didn’t profess your love to your favorite software library. I did. I love Roaring Bitmaps. Like Mariah Carey, I can’t live without it, so I won’t. I added Roaring Bitmaps to RageDB.
The folks who build the database are not the same folks who use the database and that causes problems. It has been my number 1 complaint for the past decade or so. People building features in isolation can’t see the forest for the trees and the end user experience suffers. I ran into this video from Molham Aref where he puts it quite nicely:
As much as we all love graphs, the rest of the world hasn’t quite caught on yet. They are still sending CSV files to each other like some sort of cavemen. We have a few options for dealing with them. One is to convert them to a specific file format and bulk load them into the database as fast as possible. Another is to stream them one row at a time as-is and potentially do some transformations on the fly as needed and turn each row into one or more pieces of data. Today we’re going to go with option 2.
A few years ago I was really angry at the traversal performance I was getting on some slow query. So angry that I wrote a couple thousand lines of C code just to calm myself down. This is how Neo4c came about. This blog post explains the just of it. Neo4c was able to crank out 330 million traversals per second on a single core (because it was single threaded code) for a hand crafted “query” written in C using 32 bit ids (so limited to 4b nodes and 4b relationships). It wasn’t really a fair comparison to Neo4j, but it made me realize there was a lot of performance out there to be had. Let’s see where we are today.
For some crazy reason I thought I could get away with building RageDB without a UI. I was using the “Advanced REST Client” to test functionality, it has served me well for the last few months, but it’s time we added a few coats of paint to our database server and let users interact with the graph visually instead of by squinting at JSON outputs. As a primarily back-end dev, building a front end is a scary proposition. I am barely getting the hang of C++ (yeah right) and now I’m going to have to write a bunch of Javascript and Node and React… no.
Checking if two nodes are directly connected is something you often have to do in a graph. There are a few different ways to implement this feature depending on how the database keeps track of relationships. In Neo4j a double linked list of relationships is kept per node grouped by the relationship type in both the incoming and outgoing directions. To check if two nodes are directly connected, one has to traverse one of the lists (preferably the shortest one) and checking to see if the other node id is included in that list. If we don’t know the relationship type, we have to check all the groups (for dense nodes, or light nodes there are no groups and we check them all anyway).
In Amazon Neptune the SPOG index can be queried twice. Once with the first node in the S position and the second node in the O position, then again with the positions reversed (with the P position being the relationship type). If we don’t know the relationship type we can query the indexes twice per relationship type.
Checking if two nodes are directly connected is similar to checking for set membership, and one trick we could use is a bloom filter and variant data structures. Long time readers will remember this blog post outlining exactly how to do that and achieve 100x faster checks including a “double check” to get around the probabilistic nature of these data structures.
You may have noticed that homes are selling for ridiculously high prices lately. Some are calling it inflation, others see it as a supply chain squeeze the likes of which we’ve never seen before. I’m in the latter camp and I think things will eventually return to normal levels but the “inflationistas” yearn for a simpler time. A return to the Gold Standard of currency which kept inflation in check. Do you know who else wanted to return to the gold standard?
That’s right, the “bad guy” from the G.I. Joe series wanted to burn all paper currency and use Cobra gold currency. Maybe he wasn’t so bad after all? But he did spend most of his time shouting orders at people. He commanded things to change, much in the same way we command our data to change. There is a great post on StackOverflow on this subject, I’ll reproduce the relevant bits here. If you define:
In the previous blog post we started at 32 and managed to reach 724 requests per second. We also ended it saying that I’m going to investigate a way to vectorize it so we can check multiple node properties in batches instead of doing things one at a time. Well, I found a way and the results are nothing short of outrageous.
In our Find method we were looking at a list of property values and comparing them to a single value one at a time. Computer since the early 90s have been able to perform SIMD (Single Instruction Multiple Data) operations and we’re going to take advantage of that.
At some point in your life you will look at the mirror and not recognize the person staring back at you. It’s not only the way you look, but the choices you made and the results of those choices that will seem puzzling. You may decide right then and there to embark in the most important quest of your life. Finding yourself.
Nodes and Relationships don’t get the luxury of having mirrors, but we still need to find them every once in a while. So far all we are able to do is get a node by a label and key or use the id to get a node or relationship. This is great and covers a lot of use cases, but sometimes the primary id is not what we are looking for. We may be looking for all nodes of a type that share a property value equal to something, or greater/less than something else. How can we go about finding the answers to these types of queries in our graph?