Online Payment Risk Management with Neo4j

I really like this saying by Corey Lanum:

"Almost all fraud cases involve the fabrication of a relationship, so… model your data to highlight relationships" — @corey_lanum

— Max De Marzi is building RageDB (@maxdemarzi) November 21, 2013

Finding the relationships that should not be there is a great use case for Neo4j, and today I want to highlight an example of why. When you purchase something online, the merchant hands off your information to the payment gateway which processes your actual payment. Before they accept the transaction, they run it via series of risk management tests to validate that it is a real transaction and protect themselves from fraud. One of the hardest things for SQL based systems to do is cross check the incoming payment information against existing data looking for relationships that shouldn’t be there.

For example, given a credit card number, a phone number, email address and an IP address find:

1. How many unique phone numbers, emails and IP addresses are tied to the given credit card.
2. How many unique credit cards, emails, and IP addresses are tied to the given phone number.
3. How many unique credit cards, phone numbers and IP addresses are tied to the given email.
4. How many unique credit cards, phone numbers and emails are tied to the given IP address.

A high number of connections could mean a high potential for fraud. Given that the user is sitting there in front of their computer waiting to see if the merchant accepted their credit card, these queries need to return as fast as possible and in great number to handle peaks. So we’re going to build an unmanaged extension to perform this query quickly over the REST API, a data generator to give us something to test against, and a performance test to see just how fast Neo4j can answer these types of queries.

We’ll start with a unit test, so let’s build some data:

            Node cc1 = createNode(db, "1", "cc");
            Node phone1 = createNode(db, "1234567890", "phone");
            Node email1 = createNode(db, "email1@hotmail.com", "email");
            Node ip1 = createNode(db, "1.1.1.1", "ip");
            Node cc2 = createNode(db, "2", "cc");
            ....

Our createNode method, creates a node, adds the type property set to the value we passed in and adds the newly created node to an index of its type.

    private Node createNode(GraphDatabaseService db, String value, String type) {
        Index<Node> index = db.index().forNodes(type + "s");
        Node node = db.createNode();
        node.setProperty(type, value);
        index.add(node, type, value);
        return node;
    }

We’ll also need to create some relationships to tie them together:

cc1.createRelationshipTo(phone1, RELATED);
cc1.createRelationshipTo(email1, RELATED);
cc1.createRelationshipTo(ip1, RELATED);
...

Since we’ll be using this over the REST API we’ll prepare a request in JSON format, and pass it to our crossReference method (which we’ll write next) and check the actual response against our expected value

    @Test
    public void crossReference1() throws IOException {
        String requestOne;
        requestOne = "{\"cc\" : \"1\","
                + "\"phone\" : \"1234567890\", "
                + "\"email\" : \"email1@hotmail.com\", "
                + "\"ip\" : \"1.1.1.1\"}";

        Response response = service.crossReference(requestOne, db);
        List<HashMap<String,Integer>> actual = objectMapper.readValue((String) response.getEntity(), List.class);
        ...prepare expected value...
        assertEquals(expected, actual);

We’ll expect a JSON POST request with a hash of the 4 attributes of our payment, and prepare a result list which will hold our answers:

    @POST
    @Path("/crossreference")
    public Response crossReference(String body, @Context GraphDatabaseService db) throws IOException {
        List<Map<String, AtomicInteger>> results = new ArrayList<Map<String, AtomicInteger>>();
        HashMap input = objectMapper.readValue( body, HashMap.class);

Then we’ll look up the credit card, phone number, email and ip in their respective index and add them to an array of nodes:

        ArrayList<Node> nodes = new ArrayList<Node>();
        IndexHits<Node> ccIndex = db.index().forNodes("ccs").get("cc", input.get("cc"));
        IndexHits<Node> phoneIndex = db.index().forNodes("phones").get("phone", input.get("phone"));
        IndexHits<Node> emailIndex = db.index().forNodes("emails").get("email", input.get("email"));
        IndexHits<Node> ipIndex = db.index().forNodes("ips").get("ip", input.get("ip"));
        nodes.add (ccIndex.getSingle());
        nodes.add (phoneIndex.getSingle());
        nodes.add (emailIndex.getSingle());
        nodes.add (ipIndex.getSingle());

For each of the nodes, we’ll start with an empty map of counters, and traverse the “RELATED” relationship in both directions, incrementing the type of node we find on the other end in our map:

        for(Node node : nodes){
            HashMap<String, AtomicInteger> crosses = new HashMap<String, AtomicInteger>();
            crosses.put("ccs", new AtomicInteger(0));
            crosses.put("phones", new AtomicInteger(0));
            crosses.put("emails", new AtomicInteger(0));
            crosses.put("ips", new AtomicInteger(0));
            if(node != null){
                for ( Relationship relationship : node.getRelationships(RELATED, Direction.BOTH) ){
                    Node thing = relationship.getOtherNode(node);
                    String type = thing.getPropertyKeys().iterator().next() + "s";
                    crosses.get(type).getAndIncrement();
                }
            }
            results.add(crosses);
        }

Finally we’ll return our results:

        return Response.ok().entity(objectMapper.writeValueAsString(results)).build();

… and that’s it. Seriously. Our results are very simple since they are meant to be parsed and processed by another method that does the actual risk analysis. In the sample result below, the credit card used returned 4 ips, 7 emails and 3 phone numbers which increases the odds that it may be fraudulent.

[{"ips":4,"emails":7,"ccs":0,"phones":4}, -- cc returned 4 ips, 7 emails, and 3 phones.
{"ips":1,"emails":1,"ccs":1,"phones":0}, -- phone returned just 1 item for each cross reference check.
{"ips":2,"emails":0,"ccs":4,"phones":3}, -- email returned 2 ips, 4 credit cards and 3 phones.
{"ips":0,"emails":1,"ccs":3,"phones":2}] -- ip returned 3 credit cards and 2 phones.

Now that we have our method and unit test passing, we need to generate some data. We’ll start with the root of where this data comes from which is processed transactions. We’ll create 50k transactions, and every 100 we’ll generate some potentially fraudulent data by adding between 1 to 10 additional transactions that share some of the same fields. To make our life easier, we’ll use a random number to represent the hashed credit card number and use the Faker Gem to build realistic data for our other fields:

  transactions = File.open("transactions.csv", "a")
  50000.times do |t|
    values = [rand.to_s[2..8], Faker::PhoneNumber.short_phone_number, Faker::Internet.email, Faker::Internet.ip_v4_address]
    transactions.puts values.join(",")
    if (t%100 == 0)
      rand(1..10).times do
        # Select 1, 2 or 3 fields to change
        change = [0,1,2,3].sample(rand(1..3))
        newvalues = [rand.to_s[2..8], Faker::PhoneNumber.short_phone_number, Faker::Internet.email, Faker::Internet.ip_v4_address]
        change.each do |c|
          values[c] = newvalues[c]
        end
        transactions.puts values.join(",")
      end
    end
  end

With our transactions.csv file we’ll next extract the unique credit cards, phones, emails and ips into their own files:

  CSV.foreach('transactions.csv', :headers => true) do |row|
    ccs.puts row[0]
    phones.puts row[1]
    emails.puts row[2]
    ips.puts row[3]
  end

  %x[awk ' !x[$0]++' ccs.csv > ccs_unique.csv]
  %x[awk ' !x[$0]++' phones.csv > phones_unique.csv]
  %x[awk ' !x[$0]++' emails.csv > emails_unique.csv]
  %x[awk ' !x[$0]++' ips.csv > ips_unique.csv]

…and we’ll do the same thing for the relationships:

  CSV.foreach('transactions.csv', :headers => true) do |row|
    ccs_to_phones.puts [row[0], row[1], "RELATED"].join("\t")
    ccs_to_emails.puts [row[0], row[2], "RELATED"].join("\t")
    ccs_to_ips.puts [row[0], row[3], "RELATED"].join("\t")
    phones_to_emails.puts [row[1], row[2], "RELATED"].join("\t")
    phones_to_ips.puts [row[1], row[3], "RELATED"].join("\t")
    emails_to_ips.puts [row[2], row[3], "RELATED"].join("\t")
  end  

  %x[awk ' !x[$0]++' ccs_to_phones.csv > ccs_to_phones_unique.csv]
  %x[awk ' !x[$0]++' ccs_to_emails.csv > ccs_to_emails_unique.csv]
  %x[awk ' !x[$0]++' ccs_to_ips.csv > ccs_to_ips_unique.csv]
  %x[awk ' !x[$0]++' phones_to_emails.csv > phones_to_emails_unique.csv]  
  %x[awk ' !x[$0]++' phones_to_ips.csv > phones_to_ips_unique.csv]  
  %x[awk ' !x[$0]++' emails_to_ips.csv > emails_to_ips_unique.csv]

With our data generated, we are now ready to import it into Neo4j using the Batch Importer. Much has changed since my last blog post about the batch importer. Michael Hunger has made our life easier by allowing us to specify a way to look up nodes by an indexed property instead of having to come up with their node ids directly. The emails_unique.csv now looks like this:

email:string:emails
geo@ferry.name
loy@bednar.com
marques.welch@hesseldach.com
....

Where the header is telling us that it’s an “email” property of type “string” indexed in the “emails” index. We’ll setup our batch.properties file to use all the unique csv files we created and configure our indexes for us as well.

batch_import.nodes_files=ccs_unique.csv,phones_unique.csv,emails_unique.csv,ips_unique.csv
batch_import.rels_files=ccs_to_phones_unique.csv,ccs_to_emails_unique.csv,ccs_to_ips_unique.csv,phones_to_emails_unique.csv,phones_to_ips_unique.csv,emails_to_ips_unique.csv

batch_import.node_index.ccs=exact
batch_import.node_index.phones=exact
batch_import.node_index.emails=exact
batch_import.node_index.ips=exact

Now we can run the batch importer to load our data:

java -server -Xmx4G -jar batch-import-jar-with-dependencies.jar neo4j/data/graph.db

After we configure our unmanaged extension and start the server, we can write our performance test using Gatling as we’ve done before. We’ll use the transactions.csv file we created earlier as our test data, and send a JSON string containing our values to the URL we setup earlier:

class TestCrossReference extends Simulation {
  val httpConf = httpConfig
    .baseURL("http://localhost:7474")
    .acceptHeader("application/json")

  val testfile = csv("transactions.csv").circular

  val scn = scenario("Cross Reference via Unmanaged Extension")
    .during(30) {
    feed(testfile)
    .exec(
      http("Post Cross Reference Request")
        .post("/example/service/crossreference")
        .body("""{"cc": "${cc}", "phone": "${phone}", "email": "${email}", "ip": "${ip}" }""")
        .check(status.is(200))
      )
      .pause(0 milliseconds, 1 milliseconds)
  }

  setUp(
    scn.users(16).protocolConfig(httpConf)
  )
}

…and drumroll please:

1246 requests per second with a mean latency of 11ms on my laptop. As long as your dataset can be held in memory Neo4j will maintain these numbers regardless of your overall database size since performance is only affected by the number of relationships traversed in each query. I’ve already shown you how you can Scale UP, if you need more throughput, then a cluster of Neo4j instances can deliver it by scaling out. The code for everything shown here is available on github as always, so please don’t take my word for it and try it out yourself.

8 thoughts on “Online Payment Risk Management with Neo4j”

Michael Hunger says:

February 12, 2014 at 5:54 PM

Really impressive. Can you add a domain model picture for the graph? And perhaps an equivalent cypher query (if possible?) thanks!

Access control data mining | BigSnarf blog says:

February 18, 2014 at 6:52 PM

[…] https://maxdemarzi.com/2014/02/12/online-payment-risk-management-with-neo4j/ […]

Neo4j at Ludicrous Speed | Max De Marzi says:

February 27, 2014 at 12:08 PM

[…] the last blog post we saw how we could get about 1,250 requests per second (with a 10ms latency) using an Unmanaged […]

It’s over 9000! Neo4j on WebSockets | Max De Marzi says:

March 10, 2014 at 1:56 AM

[…] on my laptop. That’s 3.5 times as many requests as last time, and about 22 times what we started out with. Before you even ask, of course you can go faster still if you switch out JSON for a binary […]

Axel Morgner says:

March 10, 2014 at 4:14 AM

Hi Max!

Wow, that’s indeed impressive! Can I download a copy of your test database somewhere? I’d love to check how Structr compares to these numbers. Have just recently upgraded to Jetty 9.1.3.

Cheers
Axel

- maxdemarzi says:
  
  March 10, 2014 at 9:53 AM
  
  Sure, the db is right here => https://github.com/maxdemarzi/neo_pay/blob/master/graphdb.tar.gz
  There is also a data generator so you can generate as much data as you want.
  
stephaneeybert says:

September 13, 2015 at 4:01 AM

Hi Max,

Your link to Gatling takes us to a 404 page.

Kind Regards,

Stephane

- maxdemarzi says:
  
  September 13, 2015 at 10:26 PM
  
  Fixed now. It should have gone to http://gatling.io

Max De Marzi

Graphs, Graphs, and nothing but the Graphs