Death Star Queries in Graph Databases

Star Wars: Episode IV - A New Hope Death Star

In Cypher, we call any unbounded star query a “Death Star” query. You’ll recognize it if you see a star between two brackets in any part of the query:

-[*]-
the deadly pattern of a death star query

The “star” in Cypher means “keep going”, and when it is not bound by a path length -[*..3]- or relationship type(s) -[:KNOWS|FRIENDS*]- it tends to blow up Alderaaning servers. It’s hard to find a valid reason for this query, but its less deadly cousins are very important in graph workloads.

For example when looking at fraud, we may start with a Customer node and ask, which known Fraudulent nodes are within 4 hops away? A Customer HAS an Account that was ACCESSED by a Device that ACCESSED another Account that BELONGS_TO a known Fraudster. A Customer HAS a mailing Address that is very SIMILAR to an Address that BELONGS_TO a Business that is partially OWNED by a known Fraudster. These are just two out of many valid patterns in our graph. Graph databases were designed to handle these kind of queries. The trick is that every node KNOWS its relationships, every node KNOWS how it is connected.

I believe this is what differentiates a Graph Database, from a database with Graph Capabilities. These days you’ll find me at RelationalAI which takes this idea to the next level. What if instead of nodes that know how they are related, what if nodes know Kung Fu? What if nodes where intelligent enough to understand their world? Think about nodes that understand their semantics, the business rules, their limits and capabilities, how they fit into various constructs like Cliques… it will blow your mind, go read that blog post.

But let’s get back to the star queries. Graph Databases keep track of the individual relations of nodes. In Neo4j, they keep track of relationships using a double linked list. In RageDB, I keep track of relationships using two arrays of “Link” objects made up of a relationship id and other node id. Neptune uses a giant “quad” table with 3 or optionally 4 indexes. If we ask the question “How are you connected to your neighbors and what are they?”: MATCH (n {id:1})-[r:*1]-(n2) RETURN type(r), n2

In Neo4j, it would go from the Node to the first Relationship Group which “groups” the relationships of that node by Type and Direction. It would then iterate through these groups and through the linked list to traverse to the other nodes and where it would grab their properties following another single linked list. Neo4j gets slowed down by the random access pattern of jumping around in these linked lists.

In RageDB, we do something similar but use arrays instead of a linked list to find the other nodes without the random access. The shard and type of both the relationships and nodes are embedded in the ids so they are easy to get. However when we want the properties of our neighbor nodes, we need to ask the shard/cpu where they reside to grab the node properties by locations in multiple arrays and then send them back to the shard running the query. In a distributed graph database, this would entail expensive network hops, but RageDB only has to deal with much faster communication between cores inside of a CPU.

In Neptune, we would start by performing two index queries S??? and ??O? where the S and O would be substituted by the starting node id. The SPOG index would handle the first query filtering out any entries that are not Relationships (G > 0) and then for each valid entry it would perform an additional index query to get the Properties of its neighbors. These would be additional SPOG index queries where G is null. The second index query ??O? would be handled by the OSGP index in the same way if it was created. The OSGP index is the optional 4th index in Neptune, off by default. If you don’t have OSGP index enabled then Neptune would have to perform one index query per “Predicate” or Relationship Type in the system to get the incoming relationships against the POGS index. This is fine for a simple graph with a few relationship types, but becomes a problem in complex graph models. Which is a hint about what needs to happen in databases that do not keep track of individual node relationships.

Daniel ten Wolde and friends from CWI just published their paper on DuckPGQ. It builds upon DuckDB and enables it to handle Property Graph Queries. It does this by defining a graph view over existing relational tables. In the paper they give this example:

CREATE PROPERTY GRAPH pg
VERTEX TABLES (
    Student PROPERTIES ( id, name, birthDate ) LABEL Person,
    College PROPERTIES ( id, college ) )
EDGE TABLES (
    know SOURCE Person KEY ( id ) DESTINATION Person KEY( id ) PROPERTIES ( createDate, msgCount ),
    enrol SOURCE Student KEY ( id ) DESTINATION College KEY( id ) PROPERTIES ( classYear ) LABEL studiesAt )

What happens if we were to add a LIKES relationship where a Person could like another Person or could like a College? We would need to create two entries and label one LIKES and the other LIKES_COLLEGE maybe? But what if we add to our graph and allow people to like Classes or Books or Lectures or Comments or Groups or Events or Places or … yikes, that could get ugly. Right now, DuckPGQ will not support having the same label in multiple tables, as element patterns must always bind to a single table.

What if it did? Or what if it allowed star queries? What would it take to answer the question: “How are you connected to your neighbors and what are they?” If we didn’t add anything, it would have to query each edge for every SOURCE or DESTINATION that matched our starting node type. This is similar to the “incoming relationships without OSGP” in Neptune, but now we have a problem in both directions. These queries would translate into SQL UNION clauses where you may end up querying a lot of indexes without getting any results. Think about the nightmare it would be to ask for your neighbors two or three hops out?

How would we fix this? The first thought I had was Roaring Bitmaps. Keep a compressed bitmap for every relationship type in every direction and if a node has that type of relationship mark its id as 1 in that bitmap. When we go to query, we can check the bitmaps and that lets us figure out which tables to query against so we avoid useless index queries. This should be alright for a small set of relationship types, but as they grow it will create a lot of bitmaps to check.

The second thought I had was enumerating the relationship types once per direction and adding 64 bits to every node, setting a bit to true if it had a relationship of that type. This would let us keep up to 31 relationship types (with a tombstone to another mechanism to handle more than 31 types). We could split the incoming and outgoing into two 64 bits entries and keep up to 63 relationship types. Most real world graph databases I’ve run into in the wild have had less than 64 relationship types. There are some domains where thousands of relationship types are required, but this is not common.

Still… its a lot of index queries or table scans if we’re joining. You could of course create a giant materialized view over every relationship with a few indexes, but then you would have reinvented triple stores. Any way you slice it, if your database system does not keep track of the set of individual relationships of a node, then you have not built a “competent” graph database, just a tribute.

3 thoughts on “Death Star Queries in Graph Databases”

Magne says:

February 8, 2023 at 7:12 AM

> If you want to be my hero, find a way to fix this problem: https://maxdemarzi.com/2023/01/09/death-star-queries-in-graph-databases/

Let me (try to) be your hero, Marzi. (Insert favorite reference to famous cheezy pop music song, if you like.)

Couldn’t you use GraphBLAS algorithms, like they do in RedisGraph (which supports Cypher, btw) to fix that problem with “death star” queries?

Those algorithms are based on linear algebra and matrix operations on sparse matrices (which are like compressed bitmaps on speed, re: https://github.com/RoaringBitmap/RoaringBitmap ). The insight is that the adjacency list of a property-graph is actually a matrix, and then you can use linear algebra on it. But it may require the DB is built bottom up with matrices in mind from the start (instead of linked lists like Neo4j does). Maybe your double array approach in RageDB could be made to fit..

I think you’ll find this presentation on GraphBLAS positively mind-blowing, especially from this moment: https://youtu.be/xnez6tloNSQ?t=1531

Such math-based algorithms seem perfect to optimally answer unbounded (death) star queries like “How are you connected to your neighbors and what are they?”

That way, for such queries one doesn’t have to traverse the graph database as a discovery process through what each node “knows about”, but could view and operate on the database from a God-like perspective, similar to table operations in relational databases.

Further reading: https://graphblas.org/

Same comment on HN: https://news.ycombinator.com/item?id=34342371#34519670

- maxdemarzi says:
  
  February 8, 2023 at 8:09 AM
  
  Have you actually tried it? If it sums the matrix multiplication of FRIENDS^2, FRIENDS^3 and FRIENDS^4 to solve Match (n)-[:FRIENDS*2.4]->(n2), how many times will it multiply FRIENDS by itself to answer an unbound star query?
  
  - Magne says:
    
    February 9, 2023 at 9:02 AM
    
    No, I haven’t tried it. I guess you could keep multiplying and merging the FRIENDS(HIP) matrix and inspect the result after each step, and stop at some point where the matrix looks a certain way. Maybe at the point when the resulting matrix is identical to the previous resulting matrix (meaning that all reachable nodes have been reached). But then again, the question becomes: How do you know that even if you can’t get further by trying one more edge, there isn’t a combination of edges that would take you where you want? Iow. that there still exists a path of an unspecified length in the graph that actually does connect a new node with your node (indirect friendship)? That’s probably indeterminable in general (NP-complete?).
    
    Star queries are the graph equivalent of an infinite loop in programming languages (or atleast a while(true) with a break condition). It’s probably why they write:
    
    “Always set an upper limit for your variable length patterns. It is possible to have a query go wild and touch all nodes in a graph by mistake.”
    https://neo4j.com/docs/cypher-cheat-sheet/current/
    
    But in your question you do seem to bound the query. Because you are asking about _immediate neighbors_ `[r:*1]`, i.e. 1-hop ones, not those connected to you indirectly by variable length `[r:*]` :
    
    > If we ask the question “How are you connected to your neighbors and what are they?”: MATCH (n {id:1})-[r:*1]-(n2) RETURN type(r), n2
    
    Which would be trivial to answer in GraphBLAS as far as I can tell, since it would just be the corresponding row from the FRIENDS^1 matrix.

Max De Marzi

Graphs, Graphs, and nothing but the Graphs