The Real Property Graph

Is not that thing above. That’s a Chart, not a Graph. But anyway…Neo4j is designed to support the property graph model natively. There are a host of other technologies that can bolt-on a “graph layer” of some kind. However it doesn’t make them a graph database. It’s like adding a rear spoiler to a van, sure it may look cool… or ridiculous, but it won’t make it a race car. Don’t fall for it. If you need fast graph queries, use a real graph database. But today we won’t talk about that. Instead we’re going to talk about the real property graph…

…as in Real Estate. I was digging around the other day and found that the Marin County, Florida Property Appraiser hosts their data online. We’re going to import that data and see if anything interesting pops up. There are a couple of ways we could have modeled it, but we’re going with this simple model:

For our schema we’ll make sure Location ids are unique, Owner names are unique and we’ll create a multi-property index on Address.

 
CREATE CONSTRAINT ON (n:Location) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:Owner) ASSERT n.name IS UNIQUE;
CREATE INDEX ON :Address(addr, zip);

We’re going to use LOAD CSV for the import, and in this case the “csv” file is really using “tabs” to separate the fields, so we’ll add a “\t” field terminator clause. Notice below I am using periodic commit. You almost always want to use periodic commit when importing data, so don’t forget. You’ll also see that I’m using MERGE on a property that I created a constraint on above, so it will use the underlying index for fast lookups instead of slow label scans. People always forget to do that and then wonder why imports take so long. Make sure you add an index or constraint to any label-property you use in a merge.

 
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS from 'https://raw.githubusercontent.com/maxdemarzi/property_graph/master/smaller.csv' as row
FIELDTERMINATOR '\t'
MERGE (p:Owner {name: TRIM(row.Owner1)});

Also notice I do only 1 thing in this import. Some folks write 1000 line cypher scripts to import their data in one go instead of realizing you can run over the same file multiple times and make your import simpler. I tend to import all the nodes, one type at a time, then go ahead and import the relationships. In this next import script I import the addresses, and the merge statement is using the multi-property index I created earlier:

 
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS from 'https://raw.githubusercontent.com/maxdemarzi/property_graph/master/smaller.csv' as row
FIELDTERMINATOR '\t'
WITH row WHERE TRIM(row.MailAddr) <> "NO ADDRESS AVAILABLE"
MERGE (a:Address {addr: TRIM(row.MailAddr), zip:TRIM(row.MailZip)})
ON CREATE SET a.city = TRIM(row.MailCity), a.state = TRIM(row.MailStat);

I won’t paste all the import statements here, instead you can take a look at the source code for the rest. So let’s take a look at what the data looks like:

Great, let’s take a look at who owns the most property by value:

 
MATCH (owner)-[:OWNS]->(l:Location)
RETURN owner, SUM(l.total_value) AS values
ORDER BY values DESC
LIMIT 25

No surprise here, the government comes out on top:

 
╒═══════════════════════════════════════════════╤═════════╕
│"owner"                                        │"values" │
╞═══════════════════════════════════════════════╪═════════╡
│{"name":"MARTIN COUNTY"}                       │691971300│
├───────────────────────────────────────────────┼─────────┤
│{"name":"SOUTH FLORIDA WATER MANAGEMENT DISTR"}│375904720│
├───────────────────────────────────────────────┼─────────┤
│{"name":"SCHOOL BOARD OF MARTIN COUNTY"}       │261422190│
├───────────────────────────────────────────────┼─────────┤
│{"name":"MARTIN MEMORIAL MEDICAL CENTER INC"}  │171010740│
├───────────────────────────────────────────────┼─────────┤
│{"name":"FLORIDA POWER & LIGHT COMPANY"}       │150672480│
├───────────────────────────────────────────────┼─────────┤
...

But real estate is not so simple. There are many properties that are individually “owned” by an LLC:

 
MATCH (n:Owner) 
WHERE n.name ENDS WITH "LLC"
RETURN n LIMIT 5

 
╒═══════════════════════════════╕
│"n"                            │
╞═══════════════════════════════╡
│{"name":"1 95 GROUP LLC"}      │
├───────────────────────────────┤
│{"name":"1 KNOWLES ROAD LLC"}  │
├───────────────────────────────┤
│{"name":"100 COLORADO AVE LLC"}│
├───────────────────────────────┤
│{"name":"100 MLK LLC"}         │
├───────────────────────────────┤
│{"name":"100% MANAGEMENT LLC"} │
└───────────────────────────────┘

…and many properties individually owned by a Trust:

 
MATCH (n:Owner) 
WHERE n.name CONTAINS "TRUST"
RETURN n LIMIT 5

 
╒══════════════════════════════════════════════╕
│"n"                                           │
╞══════════════════════════════════════════════╡
│{"name":"(VINCENT FORST) TRUST"}              │
├──────────────────────────────────────────────┤
│{"name":"1116 LAND TRUST"}                    │
├──────────────────────────────────────────────┤
│{"name":"114 SSP LAND TRUST"}                 │
├──────────────────────────────────────────────┤
│{"name":"12 PINECLIFF DRIVE TRUST"}           │
├──────────────────────────────────────────────┤
│{"name":"160 SE ALAMANDA WAY REVOCABLE TRUST"}│
└──────────────────────────────────────────────┘

We also have properties that have multiple owners, so we could be double counting. So one thing we could do is to use the mailing address of the Owner instead of the Owner itself. We can use pattern comprehension to collect the owner names:

 
MATCH (address)<-[:MAILED_AT]-(l:Location)
RETURN address.addr, SUM(l.total_value) AS values, [(l)<-[:OWNS]-(owner) | owner.name] AS owners
ORDER BY values DESC
LIMIT 25

 
╒════════════════════════════════════╤═════════╤═══════════════════════════════════════════════════════════╕
│"address.addr"                      │"values" │"owners"                                                   │
╞════════════════════════════════════╪═════════╪═══════════════════════════════════════════════════════════╡
│"2401 SE MONTEREY RD"               │496738910│["MARTIN COUNTY"]                                          │
├────────────────────────────────────┼─────────┼───────────────────────────────────────────────────────────┤
│"500 E OCEAN BLVD"                  │261422190│["SCHOOL BOARD OF MARTIN COUNTY"]                          │
├────────────────────────────────────┼─────────┼───────────────────────────────────────────────────────────┤
│"1950 RICHMOND RD TR 103"           │171010740│["MARTIN MEMORIAL MEDICAL CENTER INC"]                     │
├────────────────────────────────────┼─────────┼───────────────────────────────────────────────────────────┤
│"700 UNIVERSE BLVD"                 │150672480│["FLORIDA POWER & LIGHT COMPANY"]                          │
├────────────────────────────────────┼─────────┼───────────────────────────────────────────────────────────┤
│"PO BOX 24680"                      │132383300│["SOUTH FLORIDA WATER MANAGEMENT DISTR"]                   │
├────────────────────────────────────┼─────────┼───────────────────────────────────────────────────────────┤
...

What may be more interesting is to connect all the properties, locations and owners that are linked together. This is called “connected components”. Luckily the Neo4j Graph Algorithms plugin provides this functionality via the unionFind algorithm.

We’ll use cypher to create a “graph projection“. The first string is for our node ids. We don’t care what type of node it is, so we get all the nodes with MATCH (n). The second is for our relationships. We don’t really care here whether it’s OWNS or MAILED_AT, so we get them all. All the nodes connected together will have the same “setId”, so we can count the nodeIds and look at the 5 biggest clusters:

 
    CALL algo.unionFind.stream(
	  'MATCH (n) RETURN id(n) as id',
	  'MATCH (n)-->(n2)
	   RETURN id(n) as source, id(n2) as target',
	  {graph:'cypher'}
	) YIELD nodeId, setId
	RETURN  setId, COUNT(nodeId) AS size
    ORDER BY size DESC
    LIMIT 5

 
╒═══════╤══════╕
│"setId"│"size"│
╞═══════╪══════╡
│0      │6060  │
├───────┼──────┤
│20856  │2346  │
├───────┼──────┤
│342    │543   │
├───────┼──────┤
│56513  │406   │
├───────┼──────┤
│3209   │365   │
└───────┴──────┘

These are starting to look interesting. Instead of streaming the result, let’s go ahead and add a property “partition” to our components by removing the .stream from the algorithm:

 
	CALL algo.unionFind(
	  'MATCH (n) RETURN id(n) as id',
	  'MATCH (n)-->(n2)
	   RETURN id(n) as source, id(n2) as target',
	  {graph:'cypher'}
	) YIELD setCount;

Now we can take a look at individual partitions if we want. For example, what is the make up for partition 0?

 
MATCH (n) WHERE n.partition = 0
RETURN labels(n)[0] AS type, COUNT(*)

We have tons of locations, 24 owners and 18 addresses:

 
╒══════════╤══════════╕
│"type"    │"COUNT(*)"│
╞══════════╪══════════╡
│"Location"│6018      │
├──────────┼──────────┤
│"Owner"   │24        │
├──────────┼──────────┤
│"Address" │18        │
└──────────┴──────────┘

Let’s find the partitions with the most Owners:

 
MATCH (n:Owner)
RETURN n.partition, COUNT(*) AS members, COLLECT(DISTINCT n.name)[0..3] AS names
ORDER BY members DESC
LIMIT 5

 
╒═════════════╤═════════╤══════════════════════════════════════════════════════════════════════╕
│"n.partition"│"members"│"names"                                                               │
╞═════════════╪═════════╪══════════════════════════════════════════════════════════════════════╡
│342          │160      │["FAIRWINDS COVE RECREATION ASSOCIATIO","FAIRWINDS COVE III CONDOMINIU│
│             │         │M ASSOC","JOES POINT HOMEOWNERS ASSOCIATION IN"]                      │
├─────────────┼─────────┼──────────────────────────────────────────────────────────────────────┤
│6528         │34       │["16330B55 TRUST","941 TARPON AVENUE TRUST","16330B52 TRUST"]         │
├─────────────┼─────────┼──────────────────────────────────────────────────────────────────────┤
│1455         │25       │["CINNAMON TREE PROPERTY OWNERS ASSOCI","PINECREST LAKES HOMEOWNERS AS│
│             │         │N","LANDINGS AT SEWALLS POINT PROPERTY O"]                            │
├─────────────┼─────────┼──────────────────────────────────────────────────────────────────────┤
│4741         │25       │["RIO SOUTH DIXIE LLC","RIO NORTH DIXIE LLC","RIO WEST DIXIE LLC"]    │
├─────────────┼─────────┼──────────────────────────────────────────────────────────────────────┤
│0            │24       │["MARTIN COUNTY","SUNRISE RESTAURANTS LLC","MARTIN COUNTY COMMUNITY RE│
│             │         │DEVELOPMEN"]                                                          │
└─────────────┴─────────┴──────────────────────────────────────────────────────────────────────┘

That partition 6528 looks interesting, it’s all trusts. Let’s see what mailing addresses they have:

 
MATCH (n:Address) 
WHERE n.partition = 6528
RETURN n

Neat, they all go to the same place.

 
╒══════════════════════════════════════════════════════════════════════╕
│"n"                                                                   │
╞══════════════════════════════════════════════════════════════════════╡
│{"zip":"33403","partition":6528,"state":"FL","addr":"1015 10TH ST","ci│
│ty":"LAKE PARK"}                                                      │
└──────────────────────────────────────────────────────────────────────┘

What does that look like:

 
MATCH (a:Address)<-[:MAILED_AT]-(location)<-[:OWNS]-(owner)
WHERE a.partition = 6528
RETURN a, location, owner

Alright so which partitions own the most real estate by value, and who are their owners:

 
MATCH (l:Location)
WITH  l.partition AS partition, SUM(l.total_value) AS values
ORDER BY values DESC
LIMIT 25
MATCH (o:Owner) WHERE o.partition =  partition
RETURN values, COLLECT(o.name)[0..3] AS owners

…and there we go:

 
╒══════════╤══════════════════════════════════════════════════════════════════════╕
│"values"  │"owners"                                                              │
╞══════════╪══════════════════════════════════════════════════════════════════════╡
│1001127322│["MARTIN COUNTY","SUNRISE RESTAURANTS LLC","MARTIN COUNTY COMMUNITY RE│
│          │DEVELOPMEN"]                                                          │
├──────────┼──────────────────────────────────────────────────────────────────────┤
│299328250 │["STATE OF FLORIDA TIITF","FLORIDA, STATE OF/TIITF REC & PARKS","FLORI│
│          │DA, STATE OF/TIITF"]                                                  │
├──────────┼──────────────────────────────────────────────────────────────────────┤
│261422190 │["SCHOOL BOARD OF MARTIN COUNTY"]                                     │
├──────────┼──────────────────────────────────────────────────────────────────────┤
│191517358 │["MARTIN MEMORIAL HEALTH SYSTEMS","MARTIN MEMORIAL MEDICAL CENTER INC"│
│          │,"COASTAL CARE CORPORATION"]                                          │
├──────────┼──────────────────────────────────────────────────────────────────────┤
...

Looking at “values” of subgraphs in Neo4j is pretty neat.
Can you think of subgraphs in your own datasets? What kind of things will emerge when we combine them together?
Only one way to find out, so grab a copy of Neo4j and get started.

Max De Marzi

Graphs, Graphs, and nothing but the Graphs