Personalization with Cypher

You hopefully have seen a TV commercial from “The Man Your Man Could Smell Like” marketing campaign put on by Old Spice, and you may have seen some of the over 100 videos Isaiah Mustafa appeared in responding to comments made on Twitter. This is a great example of personalization, and today you’ll learn how you can bring some personalization to your application, and you won’t need muscles or a horse.

We’re going to dust off the Neoflix project from the beginning of the year and add a few features. It has been updated to work on Neo4j version 1.7 and allows searching for movies that have a quote. Thanks to Jenn Alons and Vince Cima for the bug fixes during WindyCityDB.

Personalization Strategies
When we are looking at an unregistered user (somebody just browsing the site) using the Item based recommendation we already built is all we have to go on. Once the user registers and gives us some information about themselves, we can use their properties to recommend movies that other users with similar properties have liked. Once the registered user has rated a few movies, then we can move away from relying solely on their properties and recommend movies that were highly rated by users who have similar ratings history and therefore the same taste in movies.

In order to be able to personalize Neoflix, we need to first be able to find a user. If you look at the graph creation, you’ll notice we are indexing the nodes to a “vertices” index and setting a property “type” equal to “User”. Let’s get a few users and see what we have to work with.

START users = node:vertices(type="User") 
RETURN users
LIMIT 5

We’re using Cypher here and this returns a table:

==> +----------------------------------------------------------+
==> | users                                                    |
==> +----------------------------------------------------------+
==> | Node[3902]{type->"User",userId->1,gender->"F",age->1}    |
==> | Node[4065]{type->"User",userId->143,gender->"M",age->18} |
==> | Node[4064]{type->"User",userId->142,gender->"M",age->25} |
==> | Node[4067]{type->"User",userId->145,gender->"M",age->18} |
==> | Node[4066]{type->"User",userId->144,gender->"M",age->25} |
==> +----------------------------------------------------------+

Looks like we have a sampling of users with a reference field userId, but our users also seem have some additional information. We can see their gender and what seems to be an age range. What is the age distribution of our users?

START users = node:vertices(type = "User") 
RETURN users.age, COUNT(users.age) 
ORDER by users.age

We are using the Cypher COUNT function to perform the equivalent of a GROUP BY in SQL, and below are our results:

==> +------------------------------+
==> | users.age | COUNT(users.age) |
==> +------------------------------+
==> | 1         | 134              |
==> | 18        | 581              |
==> | 25        | 942              |
==> | 35        | 581              |
==> | 45        | 275              |
==> | 50        | 263              |
==> | 56        | 224              |
==> +------------------------------+

As suspected, it’s not their real ages, but an age range (1-17, 18-24, 25-34, 35-44, 45-49, 50-54, 56+). We can do the same thing to get the gender distribution of our users.

START users = node:vertices(type = "User") 
RETURN users.gender, COUNT(users.gender) 
ORDER by users.gender

Which gives us this table:

==> +------------------------------------+
==> | users.gender | COUNT(users.gender) |
==> +------------------------------------+
==> | "F"          | 821                 |
==> | "M"          | 2179                |
==> +------------------------------------+

Looking at our numbers we seem to have almost 3 times as many male sample users as female sample users. Let’s combine these two metrics and add the number of movies rated as well as the number of ratings by each segment.

START users = node:vertices(type = "User") 
MATCH users -[r1:rated]-> movies
RETURN users.gender, users.age, COUNT(DISTINCT users) AS user_cnt, COUNT(DISTINCT movies) AS mov_cnt, COUNT(r1) AS rtg_cnt
ORDER by users.gender, users.age

and see what we get:

==> +---------------------------------------------------------+
==> | users.gender | users.age | user_cnt | mov_cnt | rtg_cnt |
==> +---------------------------------------------------------+
==> | "F"          | 1         | 42       | 1804    | 5292    |
==> | "F"          | 18        | 153      | 2544    | 23247   |
==> | "F"          | 25        | 238      | 2967    | 39240   |
==> | "F"          | 35        | 157      | 2711    | 23493   |
==> | "F"          | 45        | 94       | 2161    | 9643    |
==> | "F"          | 50        | 77       | 2103    | 9568    |
==> | "F"          | 56        | 60       | 1544    | 5432    |
==> | "M"          | 1         | 92       | 2012    | 12058   |
==> | "M"          | 18        | 428      | 3083    | 74656   |
==> | "M"          | 25        | 704      | 3367    | 140239  |
==> | "M"          | 35        | 424      | 3209    | 74902   |
==> | "M"          | 45        | 181      | 2908    | 28118   |
==> | "M"          | 50        | 186      | 2628    | 26396   |
==> | "M"          | 56        | 164      | 2286    | 15472   |
==> +---------------------------------------------------------+

To validate our idea of using demographics to improve our recommendations, we need to do a little digging into our data. Let’s start by taking one user and finding out who the similar users are. Let’s try userId 1, if you recall from above, this user was female and in the 1-17 age range. To find similar users, we are going to take the movies this young lady has rated and see which other users have rated the same movies within one star of her rating. These will give us users who love the same movies she does, and hate the same movies she does.

START me = node:vertices(userId = "1") 
MATCH me -[r1:rated]-> movies <-[r2:rated]- similar_users
WHERE ABS(r1.stars-r2.stars) <= 1 
RETURN similar_users, COUNT(*) AS cnt
ORDER BY cnt DESC 
LIMIT 10

We’ve added the WHERE clause so we can limit the absolute values of our star ratings to within one rating.

==> +-----------------------------------------------------------------+
==> | similar_users                                             | cnt |
==> +-----------------------------------------------------------------+
==> | Node[5010]{type->"User",userId->1088,gender->"F",age->1}  | 45  |
==> | Node[5863]{type->"User",userId->1941,gender->"M",age->35} | 45  |
==> | Node[4600]{type->"User",userId->678,gender->"M",age->25}  | 44  |
==> | Node[5995]{type->"User",userId->2073,gender->"F",age->18} | 44  |
==> | Node[4346]{type->"User",userId->424,gender->"M",age->25}  | 43  |
==> | Node[5902]{type->"User",userId->1980,gender->"M",age->35} | 42  |
==> | Node[5527]{type->"User",userId->1605,gender->"F",age->18} | 41  |
==> | Node[5042]{type->"User",userId->1120,gender->"M",age->18} | 40  |
==> | Node[5535]{type->"User",userId->1613,gender->"M",age->18} | 39  |
==> | Node[4937]{type->"User",userId->1015,gender->"M",age->35} | 39  |
==> +-----------------------------------------------------------------+

Looking at this data we can see she is most similar to another user (1088) that shares her demographics, and probably the dad of another user (1941) who shares her demographics. The question of the validity of the user provided data complicates things, and we’ll ignore that for now, but think about how a service like Netflix deals with multiple users sharing an account (Mom likes foreign romances, Junior likes action movies, wildly dissimilar ratings make a mess of the recommendations).

Let’s combine some queries and take a look at the demographics of similar users to user 1 and the number of matching ratings.

START me = node:vertices(userId = "1") 
MATCH me -[r1:rated]-> movies <-[r2:rated]- similar_users
WHERE ABS(r1.stars-r2.stars) <= 1 
RETURN similar_users.gender, similar_users.age, COUNT(DISTINCT similar_users.userId) AS user_cnt, COUNT(r2) AS rtg_cnt
ORDER BY similar_users.gender, similar_users.age
==> +---------------------------------------------------------------+
==> | similar_users.gender | similar_users.age | user_cnt | rtg_cnt |
==> +---------------------------------------------------------------+
==> | "F"                  | 1                 | 37       | 368     |
==> | "F"                  | 18                | 150      | 1403    |
==> | "F"                  | 25                | 229      | 2074    |
==> | "F"                  | 35                | 150      | 1296    |
==> | "F"                  | 45                | 86       | 559     |
==> | "F"                  | 50                | 74       | 508     |
==> | "F"                  | 56                | 58       | 362     |
==> | "M"                  | 1                 | 88       | 697     |
==> | "M"                  | 18                | 408      | 3747    |
==> | "M"                  | 25                | 677      | 6520    |
==> | "M"                  | 35                | 410      | 3687    |
==> | "M"                  | 45                | 176      | 1385    |
==> | "M"                  | 50                | 178      | 1277    |
==> | "M"                  | 56                | 156      | 838     |
==> +---------------------------------------------------------------+

Let’s compare the matching rating count of user 1 to the other segments.

gender  age    user 1        total       matching          percent of max		
F       1         368         5292       0.069538927       1
F       18       1403        23247       0.060351873       0.867886179
F       25       2074        39240       0.05285423        0.760066813
F       35       1296        23493       0.055165368       0.793301983
F       45        559         9643       0.057969512       0.83362678
F       50        508         9568       0.053093645       0.763509706
F       56        362         5432       0.066642121       0.958342671
M        1        697        12058       0.057803948       0.831245898
M       18       3747        74656       0.050190206       0.72175698
M       25       6520       140239       0.04649206        0.668576036
M       35       3687        74902       0.04922432        0.70786712
M       45       1385        28118       0.049256704       0.708332818
M       50       1277        26396       0.048378542       0.695704471
M       56        838        15472       0.054162358       0.778878254

Her own demographic segment was the best matching, and the worst was male users in the 25-34 age range. This validates our theory with a sample size of one. We can try other users to see if they corroborate our theory, take an average and see if this is worth implementing. We can also take a look at other user data that may be available and check to see if it has predictive recommendation abilities, but let’s go with this for now.

START users = node:vertices(type = "User") 
MATCH (users)-[rating:rated]->(movies)
WHERE users.age = 1 AND users.gender = "F" AND rating.stars > 3 
RETURN movies.title, COUNT(rating) AS cnt
ORDER BY cnt DESC
LIMIT 10

Assuming a new user registered with the same demographics as user 1, we can give her the following 10 recommended movies:

==> +-------------------------------------------+
==> | movies.title                        | cnt |
==> +-------------------------------------------+
==> | "Toy Story 2 (1999)"                | 19  |
==> | "Toy Story (1995)"                  | 17  |
==> | "Sixth Sense, The (1999)"           | 16  |
==> | "Aladdin (1992)"                    | 15  |
==> | "Shakespeare in Love (1998)"        | 14  |
==> | "Bug's Life, A (1998)"              | 14  |
==> | "Beauty and the Beast (1991)"       | 13  |
==> | "Clueless (1995)"                   | 13  |
==> | "Men in Black (1997)"               | 13  |
==> | "E.T. the Extra-Terrestrial (1982)" | 12  |
==> +-------------------------------------------+

What about a male user ages 25-34?

START users = node:vertices(type = "User") 
MATCH (users)-[rating:rated]->(movies)
WHERE users.age = 25 AND users.gender = "M" AND rating.stars > 3 
RETURN movies.title, COUNT(rating) AS cnt
ORDER BY cnt DESC
LIMIT 10

Assuming a new user registered with these demographics, we can give him the following 10 recommended movies:

==> +---------------------------------------------------------------+
==> | movies.title                                            | cnt |
==> +---------------------------------------------------------------+
==> | "American Beauty (1999)"                                | 388 |
==> | "Star Wars: Episode V - The Empire Strikes Back (1980)" | 370 |
==> | "Star Wars: Episode IV - A New Hope (1977)"             | 366 |
==> | "Terminator 2: Judgment Day (1991)"                     | 328 |
==> | "Silence of the Lambs, The (1991)"                      | 326 |
==> | "Raiders of the Lost Ark (1981)"                        | 323 |
==> | "Matrix, The (1999)"                                    | 320 |
==> | "Saving Private Ryan (1998)"                            | 313 |
==> | "Braveheart (1995)"                                     | 300 |
==> | "Star Wars: Episode VI - Return of the Jedi (1983)"     | 296 |
==> +---------------------------------------------------------------+

That’s a very different list and will make an immediate better user experience if we can get just those pieces of information from the user.

We can now predict what star rating a brand new user will give a movie before they watch it given their demographics.

START movie = node:vertices(title="Toy Story 2 (1999)"),
      users = node:vertices(type = "User") 
MATCH (users)-[rating:rated]->(movie)
WHERE users.age = 1 AND users.gender = "F"
RETURN AVG(rating.stars)
==> +-------------------+
==> | AVG(rating.stars) |
==> +-------------------+
==> | 4.142857142857143 |
==> +-------------------+

Once the newly registered user has rated a few movies, we can switch to truly personalized recommendations. Here we are predicting her rating of one movie she hasn’t seen yet.

START me = node:vertices(userId = "1"), 
      movie = node:vertices(title="101 Dalmatians (1961)")
MATCH me -[r1:rated]-> movies <-[r2:rated]- similar_users -[r3:rated]-> movie
WHERE ABS(r1.stars-r2.stars) <= 1 
RETURN AVG(r3.stars)
==> +-------------------+
==> | AVG(rating.stars) |
==> +-------------------+
==> | 3.600526612751551 |
==> +-------------------+

We can also recommend 10 movies user 1 should see:

START me = node:vertices(userId = "1")
MATCH me -[r1:rated]-> movies <-[r2:rated]- similar_users -[r3:rated]-> new_movies
WHERE ABS(r1.stars-r2.stars) <= 1 AND r3.stars > 3 AND NOT((me)-[:rated]->(new_movies)) 
RETURN new_movies.title, COUNT(r3) AS cnt
ORDER BY cnt DESC
LIMIT 10

This query can take an awfully long time to return since we’re going to hit just about every relationship in the graph. A better way of handling this is to take the top 10 similar users and force our traversal to just use them.

START me = node:vertices(userId = "1"),
      similar_users = node(5010,5863,4600,5995,4346,5902,5527,5042,5535,4937)
MATCH similar_users -[rating:rated]-> new_movies
WHERE rating.stars > 3 AND NOT((me)-[:rated]->(new_movies)) 
RETURN new_movies.title, COUNT(rating) AS cnt
ORDER BY cnt DESC
LIMIT 10

Our results are now:

==> +------------------------------------------+
==> | new_movies.title                   | cnt |
==> +------------------------------------------+
==> | "Usual Suspects, The (1995)"       | 10  |
==> | "101 Dalmatians (1961)"            | 10  |
==> | "Casablanca (1942)"                | 10  |
==> | "It's a Wonderful Life (1946)"     | 10  |
==> | "Forrest Gump (1994)"              | 10  |
==> | "Elizabeth (1998)"                 | 10  |
==> | "Shawshank Redemption, The (1994)" | 10  |
==> | "American Beauty (1999)"           | 9   |
==> | "Few Good Men, A (1992)"           | 9   |
==> | "Shakespeare in Love (1998)"       | 9   |
==> +------------------------------------------+

Now we know how to make Neo4j give us personalized movie recommendations. We’ll add these to our Neoflix application in a future blog post. Just be careful your users don’t play a joke on you.

Tagged , , ,

5 thoughts on “Personalization with Cypher

  1. Agelos says:

    excellent post as always.

  2. Agelos says:

    May I ask, how do you craft and test your cypher queries against a Neo4j instance? I find the console one-liner to be a big hinder, compared to say the editor approach in most SQL DBs front-ends, where you multi-line-edit,hit-F9,check and repeat. What is the way you’ve found to mimic that with Neo4j?

    • maxdemarzi says:

      The bare-bones basic way I do it, is to bring up something like Notepad and copy and paste when I’m ready to execute.

  3. […] algorithm out of this. We also know how to predict what the start rating of the user will be using personalization, but we want to ask a different question. Can we summarize what people are saying about this […]

  4. Wes Freeman says:

    Now you can merge some of your queries together with WITH instead of passing in lists of start nodes:
    http://docs.neo4j.org/chunked/milestone/query-with.html#with-limit-branching-of-your-path-search

    Just saying.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: