Back in February, we added the ability to load a CSV file and alter the contents while importing it. We also added Date support to RageDB using a Lua library. This was a masterful job of copy and paste and got us lots of functionality very quickly. When we timed the import for LDBC SNB SF10 it came in at 28 minutes. Which wasn’t bad, but wasn’t great. Let’s try to speed that up today.
Typically we want to Reduce, Reuse and Recycle to help the environment. But today we are going to Reduce, Reuse and Recycle the Lua Sandbox Environment to give us two additional sets of permissions. The first is “Read Write” in which a user can read and write to the database but cannot create new types of nodes or relationships or data types. The second is “Read Only” which does what it sounds like.
The idea of using a programing language as the way to write queries against the database makes many security folks hyperventilate. In order to lower their heart-rate and slow their breathing we have to limit the queries using a technique known as “sandboxing“. The Sol2 library we are using in RageDB lets us create an “environment” where our queries will run. Let’s see how we go about doing this.
Je n’ai fait celle-ci plus longue que parce que je n’ai pas eu le loisir de la faire plus courte. I would have written a shorter letter, but I did not have the time.Written by Blaise Pascal is often misattributed to Mark Twain. It reminds us to try to be brief. Too many people never learn this.
Valentine’s Day was earlier this week, maybe you took your significant other to dinner, sent flowers or candy to your crush, even bought a card for that special someone. I bet however you didn’t profess your love to your favorite software library. I did. I love Roaring Bitmaps. Like Mariah Carey, I can’t live without it, so I won’t. I added Roaring Bitmaps to RageDB.
The folks who build the database are not the same folks who use the database and that causes problems. It has been my number 1 complaint for the past decade or so. People building features in isolation can’t see the forest for the trees and the end user experience suffers. I ran into this video from Molham Aref where he puts it quite nicely:
As much as we all love graphs, the rest of the world hasn’t quite caught on yet. They are still sending CSV files to each other like some sort of cavemen. We have a few options for dealing with them. One is to convert them to a specific file format and bulk load them into the database as fast as possible. Another is to stream them one row at a time as-is and potentially do some transformations on the fly as needed and turn each row into one or more pieces of data. Today we’re going to go with option 2.
A few years ago I was really angry at the traversal performance I was getting on some slow query. So angry that I wrote a couple thousand lines of C code just to calm myself down. This is how Neo4c came about. This blog post explains the just of it. Neo4c was able to crank out 330 million traversals per second on a single core (because it was single threaded code) for a hand crafted “query” written in C using 32 bit ids (so limited to 4b nodes and 4b relationships). It wasn’t really a fair comparison to Neo4j, but it made me realize there was a lot of performance out there to be had. Let’s see where we are today.
Checking if two nodes are directly connected is something you often have to do in a graph. There are a few different ways to implement this feature depending on how the database keeps track of relationships. In Neo4j a double linked list of relationships is kept per node grouped by the relationship type in both the incoming and outgoing directions. To check if two nodes are directly connected, one has to traverse one of the lists (preferably the shortest one) and checking to see if the other node id is included in that list. If we don’t know the relationship type, we have to check all the groups (for dense nodes, or light nodes there are no groups and we check them all anyway).
In Amazon Neptune the SPOG index can be queried twice. Once with the first node in the S position and the second node in the O position, then again with the positions reversed (with the P position being the relationship type). If we don’t know the relationship type we can query the indexes twice per relationship type.
Checking if two nodes are directly connected is similar to checking for set membership, and one trick we could use is a bloom filter and variant data structures. Long time readers will remember this blog post outlining exactly how to do that and achieve 100x faster checks including a “double check” to get around the probabilistic nature of these data structures.