A few months ago I wrote up how to use the graph algorithms library to find fraud rings in bank data. The graph algorithms plugin has been a huge hit and received a promotion to a fully supported library with a team of developers, data scientists and product managers behind it. It was partially rewritten and given a fresh name. It is now called the Graph Data Science Library.
We’re going to give that fraud blog post a fresh look as well, change it to use the new library as well as throw more data at it. Please be sure you go back and read the original post right now so it’s fresh in your mind what we are going to do. Make sure you have the latest version of Neo4j Desktop running ( at least version 1.2.5 ), create a new graph with version 3.5.15 and install the plugin:
Then in the configuration Settings, add the following line:
dbms.security.procedures.unrestricted=gds.*
… now start the database and let’s try creating some users. Last time we just created John, Sheila and Karen (and Robert too). This time we are going to create 1 Million users. Run the following script to create them:
WITH ["Jennifer","Michelle","Tanya","Julie","Christie", "Sophie","Amanda","Khloe","Sarah","Kaylee"] AS names FOREACH (r IN range(1,1000000) | CREATE (:User {username:names[r % size(names)] + "-" + r}) )
Oh oh, what happened?
We ran out of memory trying to execute that transaction. Why is that? Well because by default the heap size of a new Neo4j database in Desktop starts at 512MB and only goes up to 1GB. Let’s change that in the Settings file. Scroll down a bit, make this change so we now have 8GB to play with and restart the database.
dbms.memory.heap.initial_size=8G dbms.memory.heap.max_size=8G
Once it finishes restarting run that query again and you should see:
Great. Now we have 1M Users. Let’s keep going and give them all a checking account.
FOREACH (r IN range(1,1000000) | CREATE (:Account {number: r, balance: round(rand() * 1000000) / 100.0, type:"Checking"}) )
…and a savings account too:
FOREACH (r IN range(1,1000000) | CREATE (:Account {number: r, balance: round(rand() * 1000000) / 100.0, type:"Savings"}) )
Well… that created the accounts, we haven’t linked them to the users quite yet. We’re going to assume everyone has 1 checking and 1 savings account for now. Let’s start off by creating a HAS_ACCOUNT relationship from users to their checking accounts.
UNWIND range(1,1000000) AS number MATCH (user), (account) WHERE id(user) = number AND id(account) = number + 1000000 CREATE (user)-[:HAS_ACCOUNT]->(account)
Now do the same but for the savings accounts.
UNWIND range(1,1000000) AS number MATCH (user), (account) WHERE id(user) = number AND id(account) = number + 2000000 CREATE (user)-[:HAS_ACCOUNT]->(account)
You’ll notice I changed the number + 1000000 to be number + 2000000. We know the first 1M nodes are all users, the second 1M nodes are all checking accounts and the third 1 M nodes are all savings accounts. That’s why this quick trick works.
Our data so far looks like this:
Ok… let’s keep going and add some loans. Not everyone has loans, so we’ll just do 500k.
FOREACH (r IN range(1,500000) | CREATE (:Loan {number: r, balance: round(rand() * 1000000) / 100.0, type:"Unsecured Loan"}) );
Let’s throw in some credit cards as well, another 500k of them:
FOREACH (r IN range(1,500000) | CREATE (:Card {number: 4111111111111111 + r, balance: round(rand() * 1000000) / 100.0}) );
…and let’s connect them to our users.
UNWIND range(1,1000000) AS number MATCH (user), (account) WHERE id(user) = number AND id(account) = number + 3000000 CREATE (user)-[:HAS_ACCOUNT]->(account);
Oh, isn’t creating synthetic data so much fun? How about some identification numbers:
FOREACH (r IN range(1,1000000) | CREATE (:Identification {number: r + 100000000, type:"SSN"}) );
That creates 1M social security numbers, let’s connect them to our users as before.
UNWIND range(1,1000000) AS number MATCH (user), (identification) WHERE id(user) = number AND id(identification) = number + 4000000 CREATE (user)-[:HAS_ID]->(identification);
…and let’s break up the monotony by doing something funky to the data. Let’s assume we aren’t perfect in our record keeping and somehow about 1000 of our users ended up with the same social security number. Maybe it was a typo, maybe it was a same name mix up, maybe it was deliberate… who knows, let’s just delete 1000 of those connections and create them to another Identification node.
MATCH (user:User) WITH user ORDER BY rand() LIMIT 1000 MATCH (user)-[r:HAS_ID]->(identification) DELETE r WITH user MATCH (identification) WHERE ID(identification) = 4000000 + round(rand() * 1000000) CREATE (user)-[:HAS_ID]->(identification);
Cool, let’s add some phone numbers next:
FOREACH (r IN range(1,1000000) | CREATE (:Phone {number: r + 5550000000}) );
…and connect those phone numbers to people.
UNWIND range(1,1000000) AS number MATCH (user), (phone) WHERE id(user) = number AND id(phone) = number + 5000000 CREATE (user)-[:HAS_PHONE]->(phone);
Let’s do the same mixing we did with social security numbers, but this time with phone numbers… but let’s up the count a bit and say 10k phone numbers are shared.
MATCH (user:User) WITH user ORDER BY rand() LIMIT 10000 MATCH (user)-[r:HAS_PHONE]->(phone) DELETE r WITH user MATCH (phone) WHERE ID(phone) = 5000000 + round(rand() * 1000000) CREATE (user)-[:HAS_PHONE]->(phone);
Let’s take a peek and see what our data looks like now:
Alright… time for some addresses. With only a million people let’s assume this is a regional bank and only does business in Illinois. So we’ll take a collection of city names from Illinois and make up the rest. Starting the “line1” property with r makes sure our addresses are always unique, so no worries there.
WITH ["Chicago", "Aurora","Rockford","Joliet", "Naperville","Springfield", "Peoria", "Elgin", "Waukegan", "Champaign", "Bloomington", "Decatur", "Evanston", "Wheaton", "Belleville", "Urbana", "Quincy", "Rock Island"] AS cities, ["North", "South", "East", "West", "SouthWest", "SouthEast", "NorthWest", "NorthEast"] AS direction, ["Main", "Park", "Oak", "Pine", "Maple", "Cedar", "Elm", "Washington", "Lake", "Hill", "First", "Second", "Third", "Fourth" + "Fifth"] AS street1, ["Drive", "Lane","Avenue", "Way", "Circle", "Square", "Court", "Road", "Alley", "Fork","Grove", "Heights", "Landing", "Path", "Route", "Trail", "Cove", "Loop", "Bend"] AS street2 FOREACH (r IN range(1,1000000) | CREATE (ad:Address {line1: r + " " + direction[r % size(direction)] + " " + street1[r % size(street1)] + " " + street2[r % size(street2)], city: cities[r % size(cities)], state:"IL", zip: 60400 + (r % size(cities)) % 100 }) )
Let’s connect our users to addresses just like before.
UNWIND range(1,1000000) AS number MATCH (user), (address) WHERE id(user) = number AND id(address) = number + 6000000 CREATE (user)-[:HAS_ADDRESS]->(address);
…and let’s make about 100k of them co-habitants.
MATCH (user:User) WITH user ORDER BY rand() LIMIT 100000 MATCH (user)-[r:HAS_ADDRESS]->(address) DELETE r WITH user MATCH (address) WHERE ID(address) = 6000000 + round(rand() * 1000000) CREATE (user)-[:HAS_ADDRESS]->(address);
What does our data look like now:
We have enough data to connect our users based on their profiles. So finally we get to use the Graph Data Science Library. The first thing we want to do is to load a graph to memory. I am going to be using a Cypher Projection of just the Users and any connections between them. We’ll call this graph “profiles”, the syntax looks like this:
CALL gds.graph.create.cypher( 'profiles', 'MATCH (n:User) RETURN id(n) AS id', 'MATCH (p1:User)-->()<--(p2:User) RETURN id(p1) as source, id(p2) as target' );
So we start with the name, then we give it the query for the nodes, then the query for the relationships. In this case I’m using any and all two hop connections between two users as a valid relationship in our in memory graph.
It takes a few seconds to load this graph into memory and now we can run our Connected Components algorithm like before.
We’ll write our results to a property called “profilesComponentId” based on the components of our “profiles” graph:
CALL gds.wcc.write("profiles", { writeProperty: 'profilesComponentId' }) YIELD nodePropertiesWritten, componentCount;
This also takes a few seconds to run and returns a count of about 900k components.
Typically we are interested in the largest components, so we can run the following query to get the top 25:
MATCH (n:User) WITH n.profilesComponentId AS component, COUNT(*) AS number RETURN component, number ORDER BY number DESC LIMIT 25;
In my case componentId 125816 came out at the top, so let’s see what it contains. Before we do that however, let’s go ahead and create an index on this property which will make the User nodes easier to find.
CREATE INDEX ON :User(profilesComponentId);
Give that a second to finish populating and let’s try finding all the nodes with componentId 125816 and what they are connected to:
MATCH (n:User)-[r]-(n2) WHERE n.profilesComponentId = 125816 RETURN n, r, n2
Here is what it looks like:
Suspicious? Maybe… Maybe not. Synthetic data isn’t nearly as much fun as the real deal. Try it on your data and see if you come up with any very large components… those are probably trouble.
Let’s keep going. Let’s say you have a Mobile app and you keep track of which users accessed their accounts via this app using their phone. Let’s create about 100k ACCESSED relationships from people and their existing phone numbers:
MATCH (phone:Phone) WITH phone ORDER BY rand() LIMIT 100000 MATCH (phone)-[:HAS_PHONE]-(user) CREATE (phone)-[:ACCESSED]->(user);
Now let’s create 10k more ACCESSED relationships from random phone numbers to random users. This will simulate accounts being accessed possibly by people who shouldn’t be accessing those accounts.
MATCH (phone:Phone) WITH phone ORDER BY rand() LIMIT 10000 MATCH (user) WHERE id(user) = ROUND(rand() * 1000000) CREATE (phone)-[:ACCESSED]->(user);
Let’s say our customers are web savvy and using our online banking application. While they are there, we are keeping track of their browser fingerprints. Let’s create 1M of these:
FOREACH (r IN range(1,1000000) | CREATE (fg:Browser {fingerprint: randomUUID()}) );
… and connect them to our users.
UNWIND range(1,1000000) AS number MATCH (user), (fg) WHERE id(user) = number AND id(fg) = number + 7000000 CREATE (user)<-[:ACCESSED]-(fg);
Let’s say we have multiple accounts account being accessed by the same computer. In some cases it’s just a couple sharing a computer, in other cases it could be malicious. Run this next query 3 times:
// Run it 3 times: MATCH (fg:Browser) WITH fg ORDER BY rand() LIMIT 10000 MATCH (user) WHERE id(user) = ROUND(rand() * 1000000) CREATE (fg)-[:ACCESSED]->(user);
Ok. Before we begin exploring this graph of ACCESSED accounts, let’s clean up a bit and remove the “profiles” graph from memory:
CALL gds.graph.drop('profiles') YIELD graphName;
Now we can create our “access” graph using just the “ACCESSED” relationship type between users.
CALL gds.graph.create.cypher( 'access', 'MATCH (n:User) RETURN id(n) AS id', 'MATCH (p1:User)-[:ACCESSED]-()-[:ACCESSED]-(p2:User) RETURN id(p1) as source, id(p2) as target' );
We can write a different component property to our nodes, in this case “accessComponentId”:
CALL gds.wcc.write("access", { writeProperty: 'accessComponentId' }) YIELD nodePropertiesWritten, componentCount;
Just like before, let’s take a look at the top 25:
MATCH (n:User) WITH n.accessComponentId AS component, COUNT(*) AS number RETURN component, number ORDER BY number DESC LIMIT 25;
Don’t forget to create an index on this property to make the nodes easier to find:
CREATE INDEX ON :User(accessComponentId);
Let’s take a look at one connected component:
MATCH (n:User)-[r]-(n2) WHERE n.accessComponentId = 778 RETURN n, r, n2;
Sweet… ok, getting close to the end here folks, we can do it. Let’s drop this graph and create another with everything:
CALL gds.graph.drop('access') YIELD graphName;
We will use Cypher Projection like before, but this time we won’t care about the relationship type or direction of how two User nodes are connected, just the fact that they are:
CALL gds.graph.create.cypher( 'everything', 'MATCH (n:User) RETURN id(n) AS id', 'MATCH (p1:User)--()--(p2:User) RETURN id(p1) as source, id(p2) as target' );
CALL gds.wcc.write("everything", { writeProperty: 'everythingComponentId' }) YIELD nodePropertiesWritten, componentCount;
MATCH (n:User) WITH n.everythingComponentId AS component, COUNT(*) AS number RETURN component, number ORDER BY number DESC LIMIT 25;
Look at those 10 user components… those are interesting!
CREATE INDEX ON :User(everythingComponentId);
…let’s check them out:
MATCH (n:User)-[r]-(n2) WHERE n.everythingComponentId = 105538 RETURN n, r, n2;
Now we’re talking. I have copied and pasted the cypher queries to a gist to make it easier to try this at home and follow along. After that you have to try it on your data. Make sure to increase the heap if you are loading many millions of nodes and relationships. For large customer bases you may want to try this on a server instead of a laptop, and be sure to get in contact if you have any trouble or find any trouble in your data!