Faux Bitmap Indexes in Neo4j Part Two

Last time we introduced the problem of single model, multiple property search queries taking a bit of time in Neo4j. We saw that using composite indexes or using “additional” labels can help us in some situations but not all. I promised you a stored procedure to build fake bitmap indexes could help, so today we’re going to see how to build one.

The way this stored procedure will work is that we will get a map of the boolean logic filter and convert that into a Boolean Expression, extract the Truth Table, simplify this truth table and for each part build a compressed bitmap of the node ids that qualify for that part. Once we have the parts, we will AND/OR them together and whatever is left is our answer. We don’t want to build these compressed bitmaps over and over, so we will cache their results using Caffeine and keep them somewhat up to date depending on our data velocity.

We’ve done work with boolean logic before and are going to reuse that work. So if you don’t recall it, read through this blog post to refresh your memory. TLDR: This and That or This Other Thing and Not That Other Thing becomes (a1 & a2) | (a3 & !a4), we then calculate what a1/a2/a3/a4 is, AND/OR/NOT them and done.

Create a new Maven project in IntelliJ and add the usual Neo4j dependencies. We’re going to be using Neo4j 4.1.x and will need Java 11, so make sure you are setup that way for the project. Besides Neo4j, we will need to add Roaring Bitmap to the pom.xml.

Our result needs to have a list of node and the total size for pagination, so we need to create a new “SizeAndNodeResult” class to store that. This is what our Stored Procedure will “yield”.

 
public class SizeAndNodeResult {
    public final List<Node> nodes;
    public final Long size;

    public SizeAndNodeResult(List<Node> nodes, Long size) {
        this.nodes = nodes;
        this.size = size;
    }
}

We will next create our Procedures.class and add the definition using some default values:

 
    @Procedure(name = "com.maxdemarzi.boolean.filter", mode = Mode.READ)
    @Description("CALL com.maxdemarzi.boolean.filter(label, query, limit, offset)")
    public Stream<SizeAndNodeResult> BooleanFilter(
            @Name(value = "label") String labelName,
            @Name(value = "query") Map<String, Object> query,
            @Name(value = "limit", defaultValue = "50") Long limit,
            @Name(value = "offset", defaultValue = "0") Long offset) {

The end goal is to populate a list of node ids and then create our result, so we’ll keep track of these in a Roaring64NavigableMap since Neo4j Node Ids can be greater than the Integer Limit of the regular RoaringBitmaps.

 
Roaring64NavigableMap combinedNodeIds = new Roaring64NavigableMap();

We need to create our boolean expression from the query map which comes in looking like this:

 
{not:false, and:[ 
   {property: 'status', values: ['Unfulfilled'], not: false}," +
   {property: 'warehouse', values: ['Warehouse 3'], not: false}," +
   {property: 'season', values: ['Fall*'], not: false}," +
   {property: 'online', values: [true], not: true} 
]}

The above needs to become “(0 & 1 & 2 & !3)”… and we need to be able to turn those back into the filters, so we can use a Bidirectional Map to help us. I’ll hide the getFormula method (which probably needs to be cleaned up a bit) but you can take a look at it if you wish… it does use recursion which is the only fancy part about it.

 
MutableBiMap<HashMap<String, Object>, Integer> expressions = new HashBiMap<>();
String formula = getFormula(query, "", expressions);
BiMap<Integer, HashMap<String, Object>> inverse = expressions.inverse();

Once we have our formula, we will create the truth table, clean it up, simplify it and for each valid expression (“a and (b or NOT c)” becomes “a and b” plus “a not c”, so we have two valid expressions to work with.

 
BooleanExpression boEx = new BooleanExpression(formula);
boEx.doTabulationMethod();
boEx.doQuineMcCluskey();
boEx.doPetricksMethod();

for (String path : boEx.getPathExpressions()) {
...

Here we next need to get the expressions we must have, and the expressions we must not have. We will do each separately starting with the must haves, and save their results in an array:

 
ArrayList<Pair<Roaring64NavigableMap, Long>> filters = new ArrayList<>();

For each mustHave, we get a key that represents the Label, the property and the allowed values of that property into a Triple. We don’t want to recalculate these over and over, so we will save them in a cache. Since we allow many values to be allowed for a single property, we will “or” these together and add the node ids that are valid to the filter we created above.

 
            for (String item : mustHave) {
                Map<String, Object> filter = inverse.get(Integer.valueOf(item));
                MutableTriple<Label, String, Object> key = MutableTriple.of(label, (String)filter.get("property"), null);

                // Since the values can be inside an array, we are treating these as belonging to any in the array
                ArrayList<Object> values = (ArrayList<Object>) filter.get("values");
                Roaring64NavigableMap filterValueIds = new Roaring64NavigableMap();
                for (Object value : values) {
                    key.setRight(value);
                    // Join them together
                    Roaring64NavigableMap dimensionValueIds = valueCache.get(key);
                    if (dimensionValueIds != null) {
                        filterValueIds.or(dimensionValueIds);
                    }
                }
                filters.add(Pair.of(filterValueIds, filterValueIds.getLongCardinality()));
            }

We will come back to how the values are retrieved from the valueCache later, but once we have our filters, we will sort them and AND them together like a real boolean index would do.

 
            // Sort bitmaps in Ascending order by cardinality
            filters.sort(Comparator.comparing(Pair::other));

            // Initialize the smallest bitmap as our starting point
            if (filters.size() > 0) {
                nodeIds.or(filters.remove(0).first());
            }

            // AND the rest of the bitmaps
            for (Pair<Roaring64NavigableMap, Long> pair : filters) {
                nodeIds.and(pair.first());
            }

We are going to do the same for the mustNot filters, but instead of “or”-ing them together they will “andNot” from our current nodeIds.

 
nodeIds.andNot(filterValueIds);

At the end of each valid expression we will or the nodeIds into our combined result set”:

 
combinedNodeIds.or(nodeIds);

Finally we will create our SizeAndNodeResult object taking into account the limit and offset, and return that to our user.

 
        size = combinedNodeIds.getLongCardinality();

        results = StreamSupport.stream(Spliterators.spliteratorUnknownSize(
                combinedNodeIds.iterator(), Spliterator.CONCURRENT), false)
                .map(transaction::getNodeById)
                .skip(offset).limit(limit)
                .collect(Collectors.toList());

        return Stream.of(new SizeAndNodeResult(results, size));

That’s all fine and dandy but we haven’t actually figured out how to generate the list of nodeids from the valueCache or even how to set it up. Let’s do that now. We are going to create a LoadingCache that takes a Triple as our key and returns a RoaringBitmap. We can tweak the expire and refresh as needed.

 
    public static final LoadingCache<Triple<Label, String, Object>, Roaring64NavigableMap> valueCache = Caffeine.newBuilder()
            .expireAfterAccess(60, TimeUnit.MINUTES)
            .refreshAfterWrite(10, TimeUnit.MINUTES)
            .build(Procedures::getValues);

Alright, finally to getValues… what does that look like? It’s starts off easy:

 
    static Roaring64NavigableMap getValues(Triple<Label, String, Object> key) {
        Roaring64NavigableMap bitmap = new Roaring64NavigableMap();
        Label label = key.getLeft();
        String property = key.getMiddle();
        Object value = key.getRight();
        String valueAsString = value.toString();

Then we need to figure out what kind of query is it. Is it an EXACT query? Is it a CONTAINS query? Is it a RANGE query?
Let’s figure out the first part. If our “value” looks like “*something*” then it’s a CONTAINS query. If it’s “*something” then it’s a SUFFIX query, and “something*” a PREFIX query. We can start off with EXACT and then change it if we need to.

 
        StringSearchMode ssm = StringSearchMode.EXACT;
        if(valueAsString.startsWith("*")) {
            if(valueAsString.endsWith("*")) {
                ssm = StringSearchMode.CONTAINS;
                valueAsString = valueAsString.substring(1, valueAsString.length() - 1);
            } else {
                ssm = StringSearchMode.SUFFIX;
                valueAsString = valueAsString.substring(1);
            }
        } else if(valueAsString.endsWith("*")) {
            ssm = StringSearchMode.PREFIX;
            valueAsString = valueAsString.substring(0, valueAsString.length() - 1);
        }

Then when we run findNodes, we can pass in the value if EXACT or the string version of the value (without the “*”s) and the StringSearchMode we figured out above:

 
        try (Transaction tx = graph.beginTx()) {
            ResourceIterator<Node> nodes;
            if (ssm.equals(StringSearchMode.EXACT)) {
                nodes = tx.findNodes(label, property, value);
            } else {
                nodes = tx.findNodes(label, property, valueAsString, ssm);
            }
            while(nodes.hasNext()) {
                bitmap.add(nodes.next().getId());
            }
        }
        return bitmap;

That’s the easy part… Range queries are way more painful. We will start off with notation. A square bracket ([ ]) indicates that the range is inclusive on that side; a parenthesis (( )) means it is exclusive. So “(a,b)” means a < x < b, "[a,b]" means a <= x <= b, "(a,b]" means a < x <= b and we have to keep in mind that a or b can be null. We also have to deal with Longs, Doubles and Dates for our range queries… So what's the worst way to deal with this?

If you guessed Regular Expressions, you are correct. It's going to be a crazy ugly string, so I like to break it up into pieces that sorta make sense. We have the brackets or parentheses for the left and the right. Then we have either a number or a date. One or two of these separated by a comma. This is what I ended up with after spending too much time on this java regular expression online tester and crying myself to sleep last night:

 
    private static final String leftBracketOrParen = "[(|\\[]";
    private static final String rightBracketOrParen = "[)|\\]]";
    private static final String numberPattern = "-?\\d*\\.{0,1}\\d+";
    private static final String ISODatePattern = "[0-9]{4}-(((0[13578]|(10|12))-(0[1-9]|[1-2][0-9]|3[0-1]))|(02-(0[1-9]|[1-2][0-9]))|((0[469]|11)-(0[1-9]|[1-2][0-9]|30)))";
    private static final Pattern number = Pattern.compile(numberPattern);
    private static final Pattern numberOrDateRange =  Pattern.compile("(^" + leftBracketOrParen + ")(" + numberPattern + "|" + ISODatePattern + ")?,(" + numberPattern + "|" + ISODatePattern + ")?(" + rightBracketOrParen +")$");

So in our getValues method we check to see if we match this pattern:

 
        Matcher m = numberOrDateRange.matcher(valueAsString);
        if (m.matches()) {
            try (Transaction tx = graph.beginTx()) {
                Value lowerBound = null;
                Value upperBound = null;
                boolean includeUpper = true;
                boolean includeLower = true;

                String from = m.group(2);
                String to = m.group(13);

If we do, we get the group that caught the “from” and “to” values. If they are numbers, we parse them as such, otherwise we go with DateValues.

 
if (from != null) {
    if (number.matcher(from).matches()) {
        lowerBound = Values.numberValue(NumberFormat.getInstance().parse(from));
    } else {
        lowerBound = DateValue.parse(from);
    }
}
if (to != null) {
    if (number.matcher(to).matches()) {
        upperBound = Values.numberValue(NumberFormat.getInstance().parse(to));
    } else {
        upperBound = DateValue.parse(to);
    }
}

Then we check the first and last group for brackets or parenthesis and set the “includes”:

 
if (m.group(1).equals("(")) {
    includeLower = false;
}
if (m.group(24).equals(")")) {
    includeUpper = false;
}

Ok, so far so good. But we have a problem. The Neo4j Java API does not let us run Range Queries. It has been a feature I’ve been requesting forever, but other things always seem to be higher priority. So what do we do? We get low. We go to the lower level API (the Kernel API) and try it from there. This is what it looks like:

 
KernelTransaction ktx = ((InternalTransaction) tx).kernelTransaction();
TokenRead tokenRead = ktx.tokenRead();
SchemaRead schemaRead = ktx.schemaRead();
Read read = ktx.dataRead();
CursorFactory cursors = ktx.cursors();

In the Kernel API everything runs by ids and uses cursors. So we use the label id and the property key id to get the index and build our predicate query:

 
int labelId = tokenRead.nodeLabel(label.name());
int propertyKeyId = tokenRead.propertyKey(property);

LabelSchemaDescriptor schema = SchemaDescriptor.forLabel(labelId, propertyKeyId);
IndexDescriptor indexDescriptor = Iterators.single(schemaRead.index(schema));
IndexReadSession indexSession = read.indexReadSession(indexDescriptor);
IndexQuery.RangePredicate<?> predicate =  IndexQuery.range(propertyKeyId, lowerBound, includeLower, upperBound, includeUpper);

Then to actually get the node ids, we run a cursor over nodeIndexSeek and retrieve our nodeReference which are our node Ids.

 
try (NodeValueIndexCursor cursor = cursors.allocateNodeValueIndexCursor(PageCursorTracer.NULL)) {
    read.nodeIndexSeek(indexSession, cursor, IndexQueryConstraints.unconstrained(), predicate);
    while (cursor.next()) {
        bitmap.add(cursor.nodeReference());
    }
}

Does that look pretty horrific? Yes it does. Would it be much better if the Neo4j Java API had Range queries? Yes it would.
What can you do? Write to your Senators and Representatives and demand Range Queries be added to the Neo4j Java API…and if they won’t listen try your local Neo4j sales person. As always, the code is on github.

Max De Marzi

Graphs, Graphs, and nothing but the Graphs