Finding Fraud

It’s no secret that one of our hottest use cases lately has been Fraud Detection. A while back we did a webinar talking about some of the ways you could use Neo4j to fight fraud. Watch it, if you haven’t yet. Today I want to augment that webinar with some cypher queries. Let’s see how it works:

In our model, we have Users that use their credit cards to make Transactions at different Merchants. Fraudsters with stolen credit card data test their stolen numbers, make big purchases and then disappear. What we are after is finding either where the credit card data may have been stolen or merchants where stolen credit card data is being tested.

Let’s start with John. He has made some credit card transactions:

Let’s create this data in our graph. We have a user, some merchants and some transactions made at those merchants:

CREATE (john:User {name:"John"})
CREATE (m1:Merchant {name:"Computer Store"})
CREATE (m2:Merchant {name:"Gas Station"})
CREATE (m3:Merchant {name:"Jewelry Store"})
CREATE (m4:Merchant {name:"Furniture Store"})
CREATE (tx1:Transaction:Fraudulent {amount: 2000.00, date:datetime()})
CREATE (tx2:Transaction {amount: 35.00, date:datetime() - duration('P1D')})
CREATE (tx3:Transaction {amount: 25.00, date:datetime() - duration('P2D')})
CREATE (tx4:Transaction {amount: 12.00, date:datetime() - duration('P3D')})
CREATE (tx1)-[:AT_MERCHANT]->(m1)
CREATE (tx2)-[:AT_MERCHANT]->(m2)
CREATE (tx3)-[:AT_MERCHANT]->(m3)
CREATE (tx4)-[:AT_MERCHANT]->(m4)

That’s a good start, but we still need to connect John to his transactions. How could we do that? Well we could just create a MAKES relationship from John to each like so:

CREATE (john)-[:MAKES]->(tx1)
CREATE (john)-[:MAKES]->(tx2)
CREATE (john)-[:MAKES]->(tx3)
CREATE (john)-[:MAKES]->(tx4)

That’s one way to model the relationship and it works great to know every transaction a particular user made. But what if we wanted to get just the most recent transactions? Well, with this model we’d have to traverse all of them, check their date property, sort and retrieve just a few.

// The last week of John's transactions using the MAKES relationships
MATCH p = (n:User {name:"John"})-[:MAKES]->(tx)
WHERE > datetime() - duration('P7D')

The results would look like this:

That seems like a lot of work. What we can do instead is arrange them in a Linked List. Yeah you know, that thing you learned about in class but never really used outside of whiteboard interviews?

We will make a chain of transactions starting from John to his most recent transaction and going back in time like so:

CREATE (john)-[:PREV_TX]->(tx1)
CREATE (tx1)-[:PREV_TX]->(tx2)
CREATE (tx2)-[:PREV_TX]->(tx3)
CREATE (tx3)-[:PREV_TX]->(tx4)

If we wanted to see these transactions using the PREV_TX relationship chain, we could use the magic star (*) character in Cypher to “keep going” from John down the way.

// The last week of John's transactions using the PREV_TX relationships
MATCH p = (n:User {name:"John"})-[:PREV_TX*]->(tx)
WHERE NONE (tx IN nodes(p) 
            WHERE COALESCE(, datetime()) <= datetime() - duration('P7D'))

Alright let’s make this interesting. John notices a large transaction in his statement he didn’t make and calls his credit card company. They mark the Transaction as Fraudulent.

But John is not the only customer. Let’s add Karen, Sheila and Robert as more Users, along with their Transactions and a few more Merchants. I’m not going to put all that cypher here, but you can see it on this gist.

Karen and Sheila also found transactions they did not make on their statement and called in. Unbeknownst to them (and us for now), they had all visited the same Gas Station about two weeks ago. How can we use Neo4j to find out this interesting fact?

Let’s start by finding all the fraudulent transactions in the last week and see which other transactions these users made, up to two weeks ago. We can make use of the PREV_TX relationship chain to go backwards until we find them all.

// All the transactions marked fraudulent in the last week and 
// the transactions that came before them up to two weeks ago.
MATCH p = (fraud:Fraudulent)-[:PREV_TX*]->(tx)
WHERE > datetime() - duration('P7D')
  AND NONE (tx IN nodes(p) WHERE COALESCE(, datetime()) <= datetime() - duration('P14D'))

From these chains of transactions we need to find the common links connecting them together.

We can find out which merchants these transactions occurred at in order to connect them together. We’ll do that and order them by a count of purchases.

// Top 5 common merchants from fraudulent transaction chains up to two weeks ago.
MATCH p = (fraud:Fraudulent)-[:PREV_TX*]->(tx)
WHERE > datetime() - duration('P7D')
  AND NONE (tx IN nodes(p) 
            WHERE COALESCE(, datetime()) <= datetime() - duration('P14D'))
WITH nodes(p) AS transactions
UNWIND transactions AS tx
MATCH (tx)-[:AT_MERCHANT]->(merchant)

There is that Gas Station we were looking for at the top of the list. The fraud could have originated there, or frauders may have used that place to test our stolen credit cards. Let’s dig a little further and find out who else is at risk.

First let’s get the Users we know were affected by fraud. We’ll want to exclude these users since it’s too late for them.

// Find the users who reported fraud the last week
MATCH (fraud:Fraudulent)<-[:MAKES]-(user)
WHERE > datetime() - duration('P7D')
WITH COLLECT(user) AS users

Now, let’s go get their transaction chains:

// Find the transactions that happened before the reported fraud in the last 2 weeks
MATCH p = (fraud:Fraudulent)-[:PREV_TX*]->(tx)
WHERE > datetime() - duration('P7D')
  AND NONE (tx IN nodes(p) 
            WHERE <= datetime() - duration('P14D'))
WITH users, nodes(p) AS transactions

Let’s continue to the merchants involved:

// Find out at which merchant these transactions took place
UNWIND transactions AS tx
MATCH (user)-[:MAKES]->(tx)-[:AT_MERCHANT]->(merchant)

Next let’s change how we counted transactions. Instead of the total number of transactions we will use the count of distinct users who made transactions there. This will make a daily coffee shop transaction equal to one instead of once for each day.

// Use the number of distinct users who made transactions as a score
WITH users, merchant, COUNT(DISTINCT user) AS userCount

Now we’ll find other users who made transactions at those merchants in the last two weeks and filter out the users who already fell victim to fraud.

// Find out who else may be at risk
MATCH (merchant)<-[:AT_MERCHANT]-(tx)<-[:MAKES]-(user)
WHERE > datetime() - duration('P14D')
  AND NOT user IN users

Then we’ll collect our set of potential merchants. Different users may be victims to multiple fraudsters that are operating at the same time, so we’ll group our potential merchants by user.

// Which merchants may have been the first sign of fraud
WITH DISTINCT user, COLLECT({merchant:, 
     userCount: userCount}) AS potentials, 
     AVG(userCount) AS averageuserCount
WITH user, averageuserCount, potentials
UNWIND potentials AS potential

…and get the merchants where an above average number of users made transactions.

// Return those with higher than average 
WITH user, potential, averageuserCount
WHERE potential.userCount > averageuserCount
RETURN user, potential.merchant, potential.userCount
ORDER BY user, potential.userCount DESC

And here is the result:

Surprise! The fraud could have originated at the Gas Station, but it could also have started at the Bowling Alley!

// Surprise, it could have been the Bowling Alley!
MATCH p=(m:Merchant {name:"Bowling Alley"})<-[:AT_MERCHANT]-(tx)<-[:MAKES]-(user) 

We would run the fraud query whenever a new Fraudulent transaction was reported, and would freeze the credit cards of any users who may have also been affected to prevent further loses from occuring. If you want to give it a try yourself, start Neo4j and run this Cypher Script to generate the example data and try the queries!

If you have a moderate amount of transactions to deal with, this query works great, but at large scales there is a performance improvement we can make. Once we have the Merchants involved in the chain of fraudulent transactions and we go find the transactions that occurred with the last 2 weeks we run this MATCH statement.

// Find out who else may be at risk
MATCH (merchant)<-[:AT_MERCHANT]-(tx)<-[:MAKES]-(user)
WHERE > datetime() - duration('P14D')

Do you see the problem? If a merchant has lots of transactions and we have kept a long history, we will be looking at many transactions and their date property. So what can we do to speed this up? Well, we can promote the date property into the AT_MERCHANT relationship, so it becomes AT_MERCHANT_ON_2019_08_19 for example. If you are a regular you’ve seen me use this trick here and there, but if you are new follow those links and watch this explanation. Now in the query we construct we would only traverse from merchants through the 14 relationship types (one for each day we care about) and not care about how much history we have.

MATCH (merchant)<-[:AT_MERCHANT_ON_2019_08_19|

This change allows to also skip looking at the date property of the transaction and we’re off to the races. We’ll also want to incorporate more information to come up with a better score than just counting users. We could look at whether the transaction was online or not, if the credit card was present, if they used a chip, pin, zip code or just magnetic strip for the checkout, the kind of business the merchant is in, and whatever else we think could help.

So go on, and fight fraud with Neo4j.

Tagged , , , , , , , , , , , ,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: