As much as we all love graphs, the rest of the world hasn’t quite caught on yet. They are still sending CSV files to each other like some sort of cavemen. We have a few options for dealing with them. One is to convert them to a specific file format and bulk load them into the database as fast as possible. Another is to stream them one row at a time as-is and potentially do some transformations on the fly as needed and turn each row into one or more pieces of data. Today we’re going to go with option 2.
A few years ago I was really angry at the traversal performance I was getting on some slow query. So angry that I wrote a couple thousand lines of C code just to calm myself down. This is how Neo4c came about. This blog post explains the just of it. Neo4c was able to crank out 330 million traversals per second on a single core (because it was single threaded code) for a hand crafted “query” written in C using 32 bit ids (so limited to 4b nodes and 4b relationships). It wasn’t really a fair comparison to Neo4j, but it made me realize there was a lot of performance out there to be had. Let’s see where we are today.
Checking if two nodes are directly connected is something you often have to do in a graph. There are a few different ways to implement this feature depending on how the database keeps track of relationships. In Neo4j a double linked list of relationships is kept per node grouped by the relationship type in both the incoming and outgoing directions. To check if two nodes are directly connected, one has to traverse one of the lists (preferably the shortest one) and checking to see if the other node id is included in that list. If we don’t know the relationship type, we have to check all the groups (for dense nodes, or light nodes there are no groups and we check them all anyway).
In Amazon Neptune the SPOG index can be queried twice. Once with the first node in the S position and the second node in the O position, then again with the positions reversed (with the P position being the relationship type). If we don’t know the relationship type we can query the indexes twice per relationship type.
Checking if two nodes are directly connected is similar to checking for set membership, and one trick we could use is a bloom filter and variant data structures. Long time readers will remember this blog post outlining exactly how to do that and achieve 100x faster checks including a “double check” to get around the probabilistic nature of these data structures.
You may have noticed that homes are selling for ridiculously high prices lately. Some are calling it inflation, others see it as a supply chain squeeze the likes of which we’ve never seen before. I’m in the latter camp and I think things will eventually return to normal levels but the “inflationistas” yearn for a simpler time. A return to the Gold Standard of currency which kept inflation in check. Do you know who else wanted to return to the gold standard?
That’s right, the “bad guy” from the G.I. Joe series wanted to burn all paper currency and use Cobra gold currency. Maybe he wasn’t so bad after all? But he did spend most of his time shouting orders at people. He commanded things to change, much in the same way we command our data to change. There is a great post on StackOverflow on this subject, I’ll reproduce the relevant bits here. If you define:
In the previous blog post we started at 32 and managed to reach 724 requests per second. We also ended it saying that I’m going to investigate a way to vectorize it so we can check multiple node properties in batches instead of doing things one at a time. Well, I found a way and the results are nothing short of outrageous.
In our Find method we were looking at a list of property values and comparing them to a single value one at a time. Computer since the early 90s have been able to perform SIMD (Single Instruction Multiple Data) operations and we’re going to take advantage of that.
At some point in your life you will look at the mirror and not recognize the person staring back at you. It’s not only the way you look, but the choices you made and the results of those choices that will seem puzzling. You may decide right then and there to embark in the most important quest of your life. Finding yourself.
Nodes and Relationships don’t get the luxury of having mirrors, but we still need to find them every once in a while. So far all we are able to do is get a node by a label and key or use the id to get a node or relationship. This is great and covers a lot of use cases, but sometimes the primary id is not what we are looking for. We may be looking for all nodes of a type that share a property value equal to something, or greater/less than something else. How can we go about finding the answers to these types of queries in our graph?
Helene has been testing software since we met in college back in 1999. Today she is the head of QA at S&P Market Intelligence managing hundreds of people. I don’t know how she does it but she is the kind of person that breaks every piece of software she touches… even when she doesn’t want to… like the in-flight entertainment system on a long flight to France. About 15 years ago, I made a terrible mistake and asked her to test a web application I had written. She ripped it apart in minutes. I felt like the worst developer in the world, and I probably wasn’t far off. I never wanted to feel that way again, so I got better at writing tests. Now I’m only ranked second worst, right after that dev who doesn’t test at all.
We decided that our Nodes and Relationships will have Schema in our database. If we create a User Node Type and set an “age” property type as an Integer, then all User nodes will have that property as an Integer. This idea seem simple enough, but what happens if we truly don’t know the value of a property for a node? The user hasn’t told us their age, or for one of many perfectly valid reasons do not know what their age is? Or what if that age property that was once set is deleted? What do we do now?
In Part 5 of this series we talked about how we store properties, but not really how we delete them. In my first design I had “tombstone” (a value indicating this data is no longer here) by type. So for Strings it would be an empty string, for Integers it was the lowest negative value allowed, etc… but what do we do about Booleans? It’s either true or false, it doesn’t lend it self to a tombstone style value… and for the ones that do, does it even make sense?
In the movie “Field of Dreams“, a voice can be heard saying “If you build it, he will come“. For some reason people got that mixed up with “if you build it, they will come” and it became a bit of a trap that many engineers fall for. The myth being that if you build a better contraption everyone will want to use it. But that is not how the world works. You have to win the hearts and minds of the people who may want to use your product… just ask Apple which posted record results yet again.