About 6 months ago we looked at how to translate a few lines of Cypher in to way too much Java code in version 1.9.x. Since then Cypher has changed and I suck a little less at Java, so I wanted to share a few different ways to translate one into the other just in case you stuck in a mid-eighties time warp and are paid by the number of lines of code you write per hour.
But first, lemme take a #Selfie let’s make some data. Michael Hunger has a series of blog posts on getting and creating data in Neo4j, we’ll steal borrow his ideas. Let’s create 100k nodes:
WITH ["Jennifer","Michelle","Tanya","Julie","Christie","Sophie","Amanda","Khloe","Sarah","Kaylee"] AS names FOREACH (r IN range(0,100000) | CREATE (:User {username:names[r % size(names)]+r}))
…and let’s create around 500k relationships between them:
MATCH (u1:User),(u2:User) WITH u1,u2 LIMIT 5000000 WHERE rand() < 0.1 CREATE (u1)-[:FRIENDS]->(u2);
…and let’s not forget to add an index:
CREATE INDEX ON :User(username);
Now when we look at our data we can see:
Now if we wanted to build a recommendation for the top 10 Users “Michelle 1” should be friends with, but isn’t right now we’d write something like this:
MATCH (me:User {username:'Michelle1'}) -[:FRIENDS]- people -[:FRIENDS]- fof WHERE NOT(me -[:FRIENDS]- fof) RETURN fof, COUNT(people) AS friend_count ORDER BY friend_count DESC LIMIT 10
…and we’d get an error like this after the 60 second timeout in the Browser window:
Cypher as of 2.0.2 isn’t optimized for this kind of query (it’s coming), so let’s turn to the Java API. First thing we’ll want to do is find a user and then get their friends just to get used to the new Java API methods.
@GET @Path("/friends/{username}") public Response getFriends(@PathParam("username") String username, @Context GraphDatabaseService db) throws IOException { List<String> results = new ArrayList<String>(); try ( Transaction tx = db.beginTx() ) { final Node user = IteratorUtil.singleOrNull(db.findNodesByLabelAndProperty(DynamicLabel.label("User"), "username", username)); if(user != null){ for ( Relationship relationship : user.getRelationships(FRIENDS, Direction.OUTGOING) ){ Node friend = relationship.getOtherNode(user); results.add((String)friend.getProperty("username")); } } } return Response.ok().entity(objectMapper.writeValueAsString(results)).build(); }
Instead of going to the index directly, we are using the findNodesByLabelAndProperty method to find our user. Notice also, everything is wrapped in a Try block with a transaction. In 2.0 all interactions with the database have to be inside a transaction. With that out of the way, let’s take a look at getting the top 10 friends of friends which are not my current friends ordered by the number of mutual friends in Java:
@GET @Path("/fofs/{username}") public Response getFofs(@PathParam("username") String username, @Context GraphDatabaseService db) throws IOException { List<Map<String, Object>> results = new ArrayList<>(); HashMap<Node, MutableInt> fofs = new HashMap<>(); try ( Transaction tx = db.beginTx() ) { final Node user = IteratorUtil.singleOrNull(db.findNodesByLabelAndProperty(DynamicLabel.label("User"), "username", username)); findFofs(fofs, user); List<Map.Entry<Node, MutableInt>> fofList = orderFofs(db, fofs); returnFofs(results, fofList.subList(0, Math.min(fofList.size(), 10))); } return Response.ok().entity(objectMapper.writeValueAsString(results)).build(); }
I’ve placed findFofs, orderFofs and returnFofs in their own methods. We’re going to take a look at findFofs first, and I want you to pay attention because there is glaring bug that I missed the first time I did this that I am replicating here. See if you can spot it.
private void findFofs(HashMap<Node, MutableInt> fofs, Node user) { List<Node> friends = new ArrayList<>(); if (user != null){ getFirstLevelFriends2(user, friends); getSecondLevelFriends2(fofs, user, friends); } }
private void getFirstLevelFriends(Node user, List<Node> friends) { for ( Relationship relationship : user.getRelationships(FRIENDS, Direction.BOTH) ){ Node friend = relationship.getOtherNode(user); friends.add(friend); } }
Now, here is where you really want to pay attention…
private void getSecondLevelFriends(HashMap<Node, MutableInt> fofs, Node user, List<Node> friends) { for ( Node friend : friends ){ for (Relationship otherRelationship : friend.getRelationships(FRIENDS, Direction.BOTH) ){ Node fof = otherRelationship.getOtherNode(friend); if ((!user.equals(fof) && !friends.contains(fof))) { MutableInt mutableInt = fofs.get(fof); if (mutableInt == null) { fofs.put(fof, new MutableInt(1)); } else { mutableInt.increment(); } } } } }
Saw it? Me neither. Let’s test the performance of this endpoint using ApacheBench:
ab -k -c 1 -n 1 'http://127.0.0.1:7474/example/service/fofs/Michelle1'
Our results are WAY better than before. 2.670 seconds vs the time outs we were seeing before.
Concurrency Level: 1 Time taken for tests: 2.670 seconds Complete requests: 1 Failed requests: 0 Write errors: 0 Keep-Alive requests: 1 Total transferred: 655 bytes HTML transferred: 522 bytes Requests per second: 0.37 [#/sec] (mean) Time per request: 2670.414 [ms] (mean) Time per request: 2670.414 [ms] (mean, across all concurrent requests) Transfer rate: 0.24 [Kbytes/sec] received
That’s a huge improvement, but Neo4j performs millions of traversals per second and can provide real time recommendations… 2.670 seconds just doesn’t sound right. So let’s dig in a little by using YourKit.
YourKit is a Java profiler which we can attach to a running Neo4j server and it’ll let us see what’s going on when we throw a little more load at it than 1 request. It’s not obvious but when you run Neo4j the name it shows up under is “Bootstrapper”. Take a look at the YourKit manual for more details.
ab -k -c 8 -n 800 'http://127.0.0.1:7474/example/service/fofs/Michelle1'
A little while after we start collecting profile information and begin running our test, this pops up:
Oh oh… something is obviously wrong… let’s dig in.
So something in getSecondLevelFriends is wasting time doing what looks like nothing…
private void getSecondLevelFriends2(HashMap<Node, MutableInt> fofs, Node user, List<Node> friends) { for ( Node friend : friends ){ for (Relationship otherRelationship : friend.getRelationships(FRIENDS, Direction.BOTH) ){ Node fof = otherRelationship.getOtherNode(friend); if ((!user.equals(fof) && !friends.contains(fof))) {
… and there it is. We’re calling contains on a List of Nodes instead of a Set of Nodes, so it’s going to scan it instead of go right to it. Log(n) vs Log(1) type of problem because I used the wrong data structure. So let’s change this to a Set and try it again.
ab -k -c 1 -n 1 'http://127.0.0.1:7474/example/service/fofs/Michelle1'
Our results are WAY better than before. 91 milliseconds vs the 2.670 seconds we were taking before, vs the timeout from where we started.
Concurrency Level: 1 Time taken for tests: 0.091 seconds Complete requests: 1 Failed requests: 0 Write errors: 0 Keep-Alive requests: 1 Total transferred: 655 bytes HTML transferred: 522 bytes Requests per second: 10.99 [#/sec] (mean) Time per request: 91.019 [ms] (mean) Time per request: 91.019 [ms] (mean, across all concurrent requests) Transfer rate: 7.03 [Kbytes/sec] received
Let’s try giving it some load:
ab -k -c 8 -n 800 'http://127.0.0.1:7474/example/service/fofs/Michelle1'
… and now we’re getting 55 requests per second real time recommendations on my laptop.
Concurrency Level: 8 Time taken for tests: 14.536 seconds Complete requests: 800 Failed requests: 0 Write errors: 0 Keep-Alive requests: 800 Total transferred: 524000 bytes HTML transferred: 417600 bytes Requests per second: 55.04 [#/sec] (mean) Time per request: 145.361 [ms] (mean) Time per request: 18.170 [ms] (mean, across all concurrent requests) Transfer rate: 35.20 [Kbytes/sec] received
As always, the full source code is available on Github.
One last thing… in Neo4j 2.1… it goes almost twice as fast.
Concurrency Level: 8 Time taken for tests: 8.523 seconds Complete requests: 800 Failed requests: 0 Write errors: 0 Keep-Alive requests: 800 Total transferred: 524000 bytes HTML transferred: 417600 bytes Requests per second: 93.86 [#/sec] (mean) Time per request: 85.234 [ms] (mean) Time per request: 10.654 [ms] (mean, across all concurrent requests) Transfer rate: 60.04 [Kbytes/sec] received
Now that’s Amazing.
This is an amazing blog post.
Is it possible to hint to cypher how to best approach the query, such that one does not need to use unmanaged extensions or the java api?
[…] that I mean that it’s completely customizable. You can add API Extensions, Plugins, Kernel Extensions, your own Cypher Functions, your own modified Kernel, its completely […]