News Feeds

Ron Burgundy Gets Hungry

Ron Burgundy (in Anchorman) gets Hungry

The “News Feed” is a core feature of social networks like Twitter, Facebook, or Vine (RIP). Let’s take a look at how we could model and implement this in Neo4j. Our social network needs Users (otherwise it would be kinda empty) that FOLLOW each other (otherwise it would not be very social). Those users need to POST some Messages (otherwise it would be boring). Here is our first attempt at a model (using Arrows):

newsfeed-model1

Now the news feed feature needs to have all the messages from all the people we follow to be sorted, newest one on top since the last time we saw the news feed. We can write a Cypher query that looks like this:

MATCH (u:User {id:'maxdemarzi'})-[:FOLLOWS]->(u:User)-[:POSTED]->(m:Message)
WHERE m.date > {last_seen_message_date}
RETURN u, m
ORDER BY m.date DESC

This query is going to work out great in the beginning, but let’s say we end up following 500 people and each posts twice a day. This query is going to have to traverse the POSTED relationship and check the date property of 1000 nodes per day our application is running. After a year it will be at 365k traversals and property checks and our news feed will take around a second to compute. Not good. Let’s see what is another way to model this? According to the documentation we could turn the messages from each User into a Linked List (you remember those from computer science class right?). Our model would then look like this:

newsfeed-model2-updated

In this case we would find the first Message a person we follow posted. Continue down the NEXT relationship until we got to a Message with a date greater than the one we care about. Then we can just keep going down NEXT without having to check the date property anymore. This is an interesting idea, but since we want messages in descending order, wouldn’t it be better to keep the latest Message first and use PREVIOUS instead of NEXT like so:

newsfeed-model3

Now we find the latest message right away and continue down the PREVIOUS relationship until we find a date earlier than the one we care about and we can stop our traversal. This is a pretty good model actually. We can make one small tweak to make it even better. We can update the User node with the date of the last post and check there first before even going to messages. This is better because while on average our users will post twice a day, some may make very few posts and others tons.

newsfeed-model4

(Almost) Nothing is free so this model comes at a cost. Whenever a User posts something we have to delete a relationship (the previous POSTED_LAST), create two new relationships (a new POSTED_LAST and a new PREVIOUS) as well as update our user node property last_post. Normally I don’t like deleting relationships because they leave empty space in your Neo4j store files, but since Neo4j 3.0 we are reusing deleted relationships by default so those empty holes will be filled up later and no more empty space.

Still that’s a lot of creating, updating and deleting. What else can we do? Well if we had Node-Centric Indexes in Neo4j we would go back to the original model and index the date property. This comes at a price since each write would now also write to an index, and we’d lose the whole “index free adjacency” benefit of Neo4j. So what can we do? Well, what if we take the date and make it part of the relationship type like this:

newsfeed-model5

Relationships in Neo4j are stored in a Relationship Chain by Type. See the bottom picture for a visual guide. In a traversal by specifying just the 2 “dated” relationship types we ignore anything else users may have posted in the past. This is limiting our traversal to 2 days worth of updates, or in our case up to 2000 posts. We could write our query like this (dynamically changing the relationship types) and be guaranteed the same level of performance on day 1 and on day 365 (assuming no new FOLLOWS relationships) .

MATCH (u:User {id:'maxdemarzi'})-[:FOLLOWS]->(u:User)
      -[:POSTED_ON_2016_10_28|POSTED_ON_2016_10_29]->(m:Message)
WHERE m.date > {last_seen_message_date}
RETURN u, m
ORDER BY m.date DESC

In Cypher, the “|” (pipe) between two relationship types means “OR” and you can have as many as you want. If you wanted to look at a User’s timeline, you could iterate through these dated relationship types, which is slightly more complex than the PREVIOUS model, but it has the benefit of built in “pagination”.

Now there is another way to model a news feed called “Graphity” by Rene Pickhardt that trades great read performance for somewhat complicated write and update rules.

What do you think? Which model do you like best? Can you come up with an even better one? Let me know in the comments below.

Tagged , , , , , , ,

2 thoughts on “News Feeds

  1. Dave Rader says:

    The Rene Pickhardt site keeps timing out (still available?). As an alternative, how about a date tree with the “recent … previous” chain?

  2. […] wrote about timelines or “news feeds” previously. If you read that blog post, you know the problem with this query. As the number of POSTED and […]

Leave a comment