Lets take a look at a scenario where you are trying to search for things by their attributes, not their description. They can be users, documents, or any object that could be described by discrete values in multiple dimensions. What does that mean exactly? Well, let me give you an example: searching for a dog. My family includes 2 four legged furry creatures named Tyler and Ronnie. They are my half lab, half golden retrievers. Dogs come in all shapes and sizes, from teacup breeds with adult weights around 5 lbs, to giant Mastiff breeds over 150 lbs. But most people don’t care exactly how much a dog weights, only their general size.
The Dog Size dimension can be described as “X-Small”, “Small”, “Medium”, “Large”, “X-Large” or “XX-Large”. Ronnie and Tyler weight in at around 65 lbs so they fit in that “Large” category. So even thought weight is a “continuous” value, we can convert it into a “discrete” value. If you can do this with how you describe your data, by clustering, bucketing, or grouping similar values together then you can take advantage of this type of search.
There are a couple of ways to approach searching by dimensions. One of them is to simply create a Cartesian product or Cross Join of all of the discrete values in each dimension against each other. For each result, we would create a node and then connect the Things we are searching for to that node with a relationship. Take a look at the video below for a refresher on Cartesian products:
This approach works amazingly well when we only have a few dimensions and the number of discrete values is small. If we have just two dimensions with 3 and 5 discrete values, it means we only have to create 15 nodes. But if we have say 15 dimensions and an average of 10 values in each, then that’s 1 quadrillion nodes… Worst part is that we probably don’t have 1 quadrillion Things to search on, so most of those will be lonely nodes without relationships. Obviously we could just not create the unused combinations. But what happens if we decide to add or modify Dimensions or discrete values… yeah it can get messy. So what’s another way to do it? Let’s say we have 4 dimensions and each one has 4 discrete values. We would create 16 nodes. Four with a Label of Dimension1, each with their own discrete value. If Size was our Dimension1 it would be:
CREATE (s1:Size {key:'Small'}),(s2:Size {key:'Medium'}),(s3:Size {key:'Large'}),(s4:Size {key:'X-Large'})
Then we add a “HAS” relationship from all the customers to one of the four nodes. So it ends up looking like this:
So if we wanted to find all the customers that had a Dimension1 of “d1a”, Dimension2 of “d2c”, Dimension3 of “d3c” and Dimension4 of “d4a”, we could run this query:
Which would find all the discrete values we specified, and then traverse to all the customers related to all four of them. Giving us this result:
So that’s pretty nice, but going back to our heavy example with 15 dimensions and lets say we have millions of customers. Is there a fast way to do this query? Yes, but we have to be a little smarter in how we traverse the graph. One thing we can do is to find the discrete value with the least amount of relationships and gather them into a list. Then we find the discrete value with the next least amount of relationships and gather them into a list. We can AND both lists and whatever remains are our Customers related to both discrete values. We can continue down this path ANDing as we go until we are left with a small final set. This would work… but it doesn’t sound very graphy. It sounds more like what a Relational Database would do using bitmap indexes.
Maybe we can do better with a hybrid approach. We will start the same way, but once our ever reducing lists reaches a certain threshold (say 100k or 50k nodes) we turn it around. Instead of traversing from the Discrete values to the Customers, we traverse from the remaining Customers out to the Discrete values and eliminate any as soon as they relate to a Discrete value we don’t want.
Let’s build ourselves another Cypher Stored Procedure. We’ll call this one dimensionalSearch and take a map parameter as input:
@Description("com.maxdemarzi.dimensionalSearch(map) | Find Distinct Customers by many dimensions") @Procedure(name = "com.maxdemarzi.dimensionalSearch", mode = Mode.READ) public Stream<NodeResult> performDimensionalSearch(@Name("map") Map input) throws EntityNotFoundException {
To call our procedure we would pass in our map like this:
CALL com.maxdemarzi.dimensionalSearch({Dimension1:"d1a",Dimension2:"d2c",Dimension3:"d3c",Dimension4:"d4a"})
The first thing we’ll do is find the dimensional attributes we asked for in the map and put them in to an array as well as a set. Why both? Well because I need to order them, but I also want a fast “contains” and that’s better in a set than an array.
ArrayList<Node> attributes = new ArrayList<>(); HashSet<Node> attributeSet = new HashSet<>(); Iterator it = input.entrySet().iterator(); while (it.hasNext()) { Map.Entry pair = (Map.Entry)it.next(); Node attribute = db.findNode(Label.label((String)pair.getKey()), "value", pair.getValue()); attributes.add(attribute); } attributeSet.addAll(attributes);
I sort the array by degree, with the lowest degree at the beginning of our array:
attributes.sort(Comparator.comparingInt(Node::getDegree));
Now I want to traverse from the first attribute to all the customers related to it and store the result in a list. In this case I am using a RoaringBitmap and storing just the node ids.
RoaringBitmap nodeIds = new RoaringBitmap(); Node firstAttribute = attributes.remove(0); for (Relationship r : firstAttribute.getRelationships(Direction.INCOMING)) { nodeIds.add(((Number)r.getStartNode().getId()).intValue()); }
We next continue traversing from the attribute with the second lowest degree out to Customers and store those in a list. We then AND from our starting list and remove that attribute from our array.
RoaringBitmap nextNodeIds = new RoaringBitmap(); it = attributes.iterator(); while (it.hasNext()){ Node attribute = (Node)it.next(); for (Relationship r : attribute.getRelationships(Direction.INCOMING)) { nextNodeIds.add(((Number)r.getStartNode().getId()).intValue()); } nodeIds.and(nextNodeIds); attributes.remove(attribute); ...
In our while loop, we will break if get to our threshold. For the sample project I put in 100, but in the real application you would use a larger number.
if (nodeIds.getCardinality() < 100 ) { break; }
So now we’ll collect our remaining Customer nodes into a list:
ArrayList<Node> remaining = new ArrayList<>(); nodeIds.forEach((IntConsumer) id -> { remaining.add(db.getNodeById((long)id)); });
…and for each one of these traverse out to the discrete values. If we encounter a value that is not in our starting set, we remove that customer from our remaining list and continue with the next customer. This short circuit will prevent us from having to traverse all of the dimensions as soon as we find an invalid one.
Iterator<Node> iterator = remaining.iterator(); while (iterator.hasNext()) { Node node = iterator.next(); for (Relationship r : node.getRelationships(Direction.OUTGOING)) { if(!attributeSet.contains(r.getEndNode())) { iterator.remove(); break; } } }
Finally we will stream out our NodeResult:
return remaining.stream().map(NodeResult::new);
Source code as always is on github.
By the way, did you know you can also take the cartesian product of two graphs. Dr. Sarada Herke explains below: