Delivering a Graph Based Search solution to slightly wrong data

oops

When it comes to databases, having good clean data is always important. More so with Graphs which deal with concepts as nodes and their relationships between them. Inevitably, you will run into messy data and have to deal with it. In a lot of the projects our customers work on they are dealing with connecting multiple data sources to get to a “golden record” or single source of truth. A lofty goal, sometimes impossible to achieve, but we can use the relationships of the data to help us come close.

One option is to extract the features (or tags) of a composite object and see if any other object shares most of these features. If that is the case then they are possibly the same object and should be merged instead of creating a new record. A partial subgraph match is something akin to a recommendation engine in Neo4j and pretty trivial to write. Take a look back at a few old blog posts for ideas.

We’re going to do something a little different today. Let’s take a look at how to deal with addresses. The model is very simple. A User has an Address. A User (or rather the user data we have) can have mistakes. Let’s say we get a second source of address data and we have addresses that don’t match. One thing we can do is to fuzzy match the address properties against known addresses of the user and see if any come close.

Let’s create some sample data. Here I have a user and their address:

CREATE (max:User {username:'maxdemarzi'})
CREATE (a1:Address {line1:'175 North Harbor Dr.', city:'Chicago', state:'IL', zip:'60605', zip_plus_4:'1234'})
CREATE (max)-[:HAS_ADDRESS]->(a1)

Now we want to know if an address slated for the same user should be added or merged in with an existing address:

{"address":"176 Norht Harbor Dr. Chicago IL 60601"}

Let’s start by finding the user in question:

Node user = db.findNode(Labels.User, "username", username);

Next we find all the addresses connected to that user. In this case I am traversing the HAS_ADDRESS relationship, getting the properties of the addresses and adding them to a set.

                HashSet<String> addresses = new HashSet<>();

                for (Relationship rel : user.getRelationships(Direction.OUTGOING, RelationshipTypes.HAS_ADDRESS)) {
                    Node address = rel.getEndNode();
                    addresses.add(
                            address.getProperty("line1", "") + " " +
                            address.getProperty("city", "") + " " +
                            address.getProperty("state", "") + " " +
                            address.getProperty("zip", "") + "-" +
                            address.getProperty("zip_plus_4", "")
                    );
                }

I will do a little clean up here and simplify the addresses to make the matching easier.

String potential_address = addressSimplifier.simplify((String) input.get("address"));

We uppercase the strings, and use a regex map to normalize each address. This is what this looks like:

public class AddressSimplifier implements Simplifier {
    @Override
    public String simplify(String pString) {
        String value = pString.toUpperCase();
        Iterator<String> i = commonItems.keySet().iterator();
        while(i.hasNext()) {
            String regex = i.next();
            String newWord = commonItems.get(regex);
            value = value.replaceAll(regex, newWord);
        }
        return value;
    }

    private static HashMap<String, String> commonItems = new HashMap<String,String>() {{
        put(" ST$", " STREET");
        put(" ST ", " STREET ");
        put("^P O| P O|^POST OFFICE| POST OFFICE", "PO");
        ...

We will borrow the simmetrics library(a Java library of similarity and distance metrics e.g. Levenshtein distance and Cosine similarity, etc) to make this easier. In this case the metric we will use is Damerau Levenshtein.

                StringMetric metric = new DamerauLevenshtein();

                for (String address : addresses) {
                    Float score = metric.compare(potential_address, addressSimplifier.simplify(address));
                    if (score > 0.50) {
                        HashMap<String, Object> result = new HashMap<>();
                        result.put("address", address);
                        result.put("score", score);
                        results.add(result);
                    }
                }

Finally we can sort it :

Collections.sort(results, scoreComparator);
return Response.ok().entity(objectMapper.writeValueAsString(results)).build();

and return the results to the user:

[{
    "score": 0.8478261,
    "address": "175 North Harbor Dr. Chicago IL 60605-1234"
  }]

Ok… now I know what you are thinking… dude, that’s pretty lame. I can do that with a relational database or search engine pretty easily. Yes you are right. But we only talked about one user’s addresses. What about the addresses of their friends, their family, their workplace, their organizations, etc? What about those? In Neo4j we can just do this:

                for (Relationship rel : user.getRelationships(Direction.BOTH,
                        RelationshipTypes.KNOWS,
                        RelationshipTypes.BELONGS_TO,
                        RelationshipTypes.WORKS_FOR)) {
                    Node addressOwner = rel.getOtherNode(user);
                    getAddresses(addressOwner, addresses);
                }

…and now we can go from a user to anything that is connected to them, find those addresses and compare them to our target address as well. The full source code of this example is on github as always.

This is the concept behind Graph Based Search. We aren’t searching a dumb index that doesn’t know anything beyond the search string. We are filtering our search to the connected values from one or more starting points not doing a global search. There may be 3000 John Smiths in your dataset, but only 1 happens to be a consultant working for a company that is providing services to one of your clients, and that is the John Smith you want to see at the top of the list. Graph Based Search is smarter search, are you using it?

Tagged , , , , , , , , , ,

2 thoughts on “Delivering a Graph Based Search solution to slightly wrong data

  1. Hi Max,

    As Neo4j embeds Lucene, why not using the LevenshteinDistance class from it?

    • maxdemarzi says:

      Sure that would work as well, but the theme of the blog post is you can do without a search engine, so using it in this case would feel wrong. The comparison function can be one of the many text functions, or it can be a lot more complicated, we can add weights to the relationships, etc.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: