If you’ve been following my blog for a while, you probably know I like using YourKit and Gatling for testing end to end requests in Neo4j. Today however we are going to do something a little different. We are going to be micro-benchmarking a very small piece of code within our Unmanaged Extension using a Java library called JMH.
In Neo4j 2.2 and up the Java API has been made a little easier to use. While in 2.1 you had to do this to get a node:
final Node node = IteratorUtil.singleOrNull( db.findNodesByLabelAndProperty(Labels.Segment, "segmentId", segmentId));
Or even worse, if you didn’t know about the IteratorUtil.singleOrNull helper:
final ResourceIterable<Node> nodes = db.findNodesByLabelAndProperty(Labels.Segment, "segmentId", segmentId); final Node node; try (ResourceIterator<Node> nodeIter = nodes.iterator()) { node = (nodeIter.hasNext() ? nodeIter.next() : null); }
Now you can simply do this:
final Node node = db.findNode(Labels.Segment, "segmentId", segmentId);
This new helper makes finding a node by a property and Label much easier. Hopefully you’ve created Unique Constraints or Indexes for all properties you’ll use to look nodes up. Now let’s see just how fast this runs. We’re going to create a method in an Unmanaged Extension that will take a list of ids and look them up. We will capture the number it found and use that for our result.
@POST @Path("/lookup") public String Lookup(String body, @Context GraphDatabaseService db) throws IOException { ArrayList<Long> idsFound = new ArrayList<>(); HashMap<String, List<Integer>> input; input = objectMapper.readValue(body, HashMap.class); List<Integer> segmentIds = input.get("segmentIds"); try (Transaction tx = db.beginTx()) { for (int segmentId : segmentIds) { final Node node = db.findNode(Labels.Segment, "segmentId", segmentId); if (node != null) { idsFound.add(node.getId()); } } tx.success(); } return idsFound.toString(); }
Why am I using a list of ids instead of a single lookup? Well because I want to show you what happens as two variables change. One, the number of nodes in our index, and two, the number of lookups done in a single method.
@State(Scope.Benchmark) public class ServiceTest { private Service service; private GraphDatabaseService db; @Param({"10000", "100000", "1000000"}) public int size;
We’ll use @Param and try three different sized graphs, at 10k, 100k, and 1M nodes. We’ll use the size parameter when we create our graph:
public void populate(GraphDatabaseService db) { Transaction tx = db.beginTx(); try { for(int loop = 1; loop < size; loop = loop+1) { Node identity = db.createNode(Labels.Segment); identity.setProperty("segmentId", loop); if(loop % 1_000 == 0){ tx.success(); tx.close(); tx = db.beginTx(); } } tx.success(); } finally { tx.close(); } }
You’ll notice I commit and start a new transaction every 1000 nodes created. I do this because we don’t to have to commit 1 million nodes all at once, it eats up too much memory. Before each Test I will setup my graph, migrate it (which creates the constraints) and populate our graph.
@Setup public void prepare() { db = new TestGraphDatabaseFactory().newImpermanentDatabase(); service = new Service(db); service.migrate(db); populate(db); }
Now we’re finally ready to create a benchmark. Our benchmark will warmup 10 times, and perform 5 measurements forking the JVM 3 times and reporting the average time it took to run.
@Benchmark @Warmup(iterations = 10) @Measurement(iterations = 5) @Fork(3) @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.MILLISECONDS) public void measureOneLookup() throws IOException { service.Lookup("{\"segmentIds\":[9781]}", db); }
I’m using IntelliJ and running it with the JMH Plugin. After about 5 minutes we get our results:
Benchmark (size) Mode Cnt Score Error Units ServiceTest.measureOneLookup 10000 avgt 15 0.032 ± 0.000 ms/op ServiceTest.measureOneLookup 100000 avgt 15 0.035 ± 0.001 ms/op ServiceTest.measureOneLookup 1000000 avgt 15 0.042 ± 0.001 ms/op
So a single lookup is complete in 0.032ms to 0.42ms. Practically no time at all. It does get a little slower as our size gets bigger but that makes sense. Now let’s try measuring 10 lookups in the same method.
Benchmark (size) Mode Cnt Score Error Units ServiceTest.measureTenLookups 10000 avgt 15 0.102 ± 0.001 ms/op ServiceTest.measureTenLookups 100000 avgt 15 0.103 ± 0.002 ms/op ServiceTest.measureTenLookups 1000000 avgt 15 0.112 ± 0.002 ms/op
Our numbers moved up, but they are still pretty reasonable. One more time with 1000 lookups in the same method:
Benchmark (size) Mode Cnt Score Error Units ServiceTest.measure1000Lookups 10000 avgt 15 7.781 ± 0.235 ms/op ServiceTest.measure1000Lookups 100000 avgt 15 8.223 ± 0.166 ms/op ServiceTest.measure1000Lookups 1000000 avgt 15 9.273 ± 0.101 ms/op
Now we are up to 9ms or so to make these 1000 lookups. In the great majority of use cases, a traversal will start from just one starting node, or a few of them. In the rare case where the traversal needs to look at data from hundreds of nodes, it may be worth the effort to add a cache. So let’s give that a shot and benchmark the same tests to compare.
private static final LoadingCache<Integer, Long> segments = CacheBuilder.newBuilder() .maximumSize(1000000) .build( new CacheLoader<Integer, Long>() { public Long load(Integer segmentId) { return getSegmentNodeId(segmentId); } }); private static Long getSegmentNodeId(Integer segmentId){ final Node node = db.findNode(Labels.Segment, "segmentId", segmentId); return node.getId(); }
Up here we are creating a LoadingCache that will take an Integer as a Key and return a Node ID as our cached value. It is important that the cache is done this way and not by caching whole nodes. If you cache the whole node, then you have to worry about cache invalidation every time you add/remove or change a property. By just caching the Node ids, we don’t have to worry (assuming our key hold values that are immutable). So let’s take a look at our cached lookup implementation:
@POST @Path("/cachedlookup") public String CachedLookup(String body, @Context GraphDatabaseService db) throws IOException, ExecutionException { ArrayList<Long> idsFound = new ArrayList<>(); HashMap<String, List<Integer>> input; input = objectMapper.readValue(body, HashMap.class); List<Integer> segmentIds = input.get("segmentIds"); try (Transaction tx = db.beginTx()) { for (int segmentId : segmentIds) { final Node node = db.getNodeById(segments.get(segmentId)); if (node != null) { idsFound.add(node.getId()); } } tx.success(); } return idsFound.toString(); }
As you can see, it uses the node id to simply getNodeById and go from there. So let’s run our benchmarks again and see what we get:
Benchmark (size) Mode Cnt Score Error Units ServiceTest.measureOneCachedLookup 10000 avgt 15 0.011 ± 0.000 ms/op ServiceTest.measureOneCachedLookup 100000 avgt 15 0.011 ± 0.000 ms/op ServiceTest.measureOneCachedLookup 1000000 avgt 15 0.012 ± 0.000 ms/op
A single lookup here takes about half the time, but still a very small amount of time. When we bump up 10 lookups we are almost the same speed as a single lookup and roughly 10x better than using the index for lookups.
Benchmark (size) Mode Cnt Score Error Units ServiceTest.measureTenCachedLookups 10000 avgt 15 0.013 ± 0.000 ms/op ServiceTest.measureTenCachedLookups 100000 avgt 15 0.013 ± 0.000 ms/op ServiceTest.measureTenCachedLookups 1000000 avgt 15 0.014 ± 0.000 ms/op
Now once we move up to doing 1000 lookups we see a real difference. 9ms vs 0.3ms. That’s 30x better than our original attempt.
Benchmark (size) Mode Cnt Score Error Units ServiceTest.measure1000CachedLookups 10000 avgt 15 0.323 ± 0.003 ms/op ServiceTest.measure1000CachedLookups 100000 avgt 15 0.320 ± 0.004 ms/op ServiceTest.measure1000CachedLookups 1000000 avgt 15 0.316 ± 0.003 ms/op
If you’ve seen the Neo4j Internals blog post you will immediately recognize why it’s so much faster. In the search case we are looking for a node id in a b-tree, in the cache case we are going to it almost directly. This is the same reason Neo4j is so much faster at joins than your RDBMS. It doesn’t have to lookup foreign keys in an index, it goes to them directly.
So the takeaway here is, only resort to this type of internal caching if you are performing lots of lookups in your queries and performance is critical. One last thing… if you do delete nodes, make sure to call cache.invalidate(value) just to be safe.
As always, code is available on github.
Update: Can we go faster? Yes we can. Let’s add a high performance primitives collection library to our project called Koloboke from Open HFT.
<dependency> <groupId>net.openhft</groupId> <artifactId>koloboke-api-jdk6-7</artifactId> <version>0.6.6</version> </dependency> <dependency> <groupId>net.openhft</groupId> <artifactId>koloboke-impl-jdk6-7</artifactId> <version>0.6.6</version> <scope>runtime</scope> </dependency>
We’ll create an Integer Long Map that is mutable so we can add and remove entries as needed.
private static IntLongMap segmentMap = HashIntLongMaps.newMutableMap();
We’ll add a new method that looks just like the others except it will use “computeIfAbsent” to either get an existing Node Id for a Segment, or find it in the index. If you are using Java 7:
for (int segmentId : segmentIds) { final Node node = db.getNodeById( segmentMap.computeIfAbsent(segmentId, new GetSegmentNodeId()));
You’ll need to create an IntToLongFunction class:
class GetSegmentNodeId implements IntToLongFunction { @Override public long applyAsLong(int segment) { return db.findNode(Labels.Segment, "segmentId", segment).getId(); } }
If you are on Java 8 (make sure you use “-jdk8” in your dependencies) and you can just use a lambda:
for (int segmentId : segmentIds) { final Node node = db.getNodeById( segmentMap.computeIfAbsent(segmentId, key -> db.findNode(Labels.Segment, "segmentId", segmentId).getId())); }
How much faster is it…
Benchmark (size) Mode Cnt Score Error Units ServiceTest.measureOneCachedLookup2 10000 avgt 15 0.004 ± 0.000 ms/op ServiceTest.measureOneCachedLookup2 100000 avgt 15 0.004 ± 0.000 ms/op ServiceTest.measureOneCachedLookup2 1000000 avgt 15 0.004 ± 0.000 ms/op Benchmark (size) Mode Cnt Score Error Units ServiceTest.measureTenCachedLookups2 10000 avgt 15 0.006 ± 0.000 ms/op ServiceTest.measureTenCachedLookups2 100000 avgt 15 0.006 ± 0.000 ms/op ServiceTest.measureTenCachedLookups2 1000000 avgt 15 0.007 ± 0.000 ms/op Benchmark (size) Mode Cnt Score Error Units ServiceTest.measure1000CachedLookups2 10000 avgt 15 0.231 ± 0.002 ms/op ServiceTest.measure1000CachedLookups2 100000 avgt 15 0.231 ± 0.003 ms/op ServiceTest.measure1000CachedLookups2 1000000 avgt 15 0.231 ± 0.005 ms/op
…about twice as fast in the single and ten lookup tests, and 50% faster in the 1000 lookup test.
Another Update: This collection is not concurrent, so you’ll have to pre-load all the nodes you are going to cache before using it. If you plan on adding nodes to the cache while the system is running you’ll have to catch any ConcurrentModificationExceptions and retry, so that makes things a little messier than the Guava Loading Cache which handles concurrency for you.