Did you know that Chicago is the most important railroad center in North America? Chicago has more lines of track radiating in more directions than from any other city. The windy city has long been the most important interchange point for freight traffic between the nation’s major railroads and it is the hub of Amtrak, the intercity rail passenger system. You may not realize it, but railroad tracks and graph theory have a history together. Back in the mid 1950s the US Military had an interest in finding out how much capacity the Soviet railway network had to move cargo from the Western Soviet Union to Eastern Europe. This lead to the Maximum Flow problem and the Ford–Fulkerson algorithm to solve it.
Today we’re going to keep things simple and just calculate the best rail road paths between two junctions. We’re going to import two files into Neo4j. One that has the junctions, and another that has the tracks between them.
You can download the files at the links above or grab them yourself from here and here.
The junction file looks like:
X,Y,OBJECTID,FRANODEID,COUNTRY,STATE,STFIPS,CTFIPS,STCYFIPS,FRAREGION,BNDRY -122.766750577806476,45.593594410320861,7001,303584,U,OR,41,051,41051,8,0 -122.766664795802114,45.641088785037418,7002,303585,U,OR,41,051,41051,8,0 -122.766608304775175,45.593653799868534,7003,303586,U,OR,41,051,41051,8,0
Once we copy it into our neo4j “import” directory, we can import it into Neo4j with this command:
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM "file:///rail_junctions.csv" AS line WITH line CREATE (j:Junction {object_id:line.OBJECTID, fra_node_id:line.FRANODEID, latitude:toFloat(line.X), longitude:toFloat(line.Y)}) RETURN COUNT(*);
We are ignoring the locations, region and other properties since we won’t need them for this exercise. It should take a few seconds, and once it finishes, we will add our constraint to make sure they are all unique and help us with loading the relationships:
CREATE CONSTRAINT ON (j:Junction) ASSERT j.fra_node_id IS UNIQUE;
The tracks file looks like:
OBJECTID,Shape_Leng,FRAARCID,FRFRANODE,TOFRANODE,STFIPS,CTFIPS,STCYFIPS,STATE,COUNTRY,FRAREGION,RROWNER1,RROWNER2,RROWNER3,TRKRGHTS1,TRKRGHTS2,TRKRGHTS3,TRKRGHTS4,TRKRGHTS5,TRKRGHTS6,TRKRGHTS7,TRKRGHTS8,TRKRGHTS9,SUBDIV,YARD,PASSNGR,STRACNET,TRACKS,NET,MILES,OLD_FRAID,SHAPE_Le_1,Shape__Length 3001,389.277457965,303000,335101,335109,30,015,30015,MT,U,8,BNSF, , , , , , , , , , , , , , , ,0,O,0.241885314395,224225,0.00444164081679,0.004441640345721 3002,199.149285396,303001,352910,352899,38,019,38019,ND,U,8,BNSF, , , , , , , , , , , ,HANNAH, , , ,1,M,0.123745381408,237398,0.00249181134818,0.002491810846554 3003,827.101664766,303002,355354,355353,38,099,38099,ND,U,8,BNSF, , , , , , , , , , , ,HANNAH, , , ,1,M,0.513936119663,232073,0.00753147341863,0.007531473219527
Once we copy it into our neo4j “import” directory, we can import it into Neo4j with this command:
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM "file:///rail_roads.csv" AS line WITH line MATCH (j1:Junction {fra_node_id:line.FRFRANODE}), (j2:Junction {fra_node_id:line.TOFRANODE}) MERGE (j1)-[l:LINKS { object_id:line.OBJECTID, fra_aarc_id:line.FRAARCID, owner:line.RROWNER1, trackage:line.TRKRGHTS1 , tracks: toInteger(line.TRACKS), network:line.NET, miles:toFloat(line.MILES) }]->(j2) RETURN COUNT(*);
It should take 30 seconds to a minute depending on your hardware, but once it finishes we will have all the junction and track data in the united states at our disposal. For example, let’s say we wanted to find the shortest path between a junction in Chicago and one in Milwaukee:
// From Chicago to Milwaukee Shortest Number of Relationships MATCH (chicago:Junction {fra_node_id: "414657"}), (milwaukee:Junction {fra_node_id: "412167"}), p = shortestPath((chicago)-[:LINKS*]-(milwaukee)) RETURN p
This is nice, but we want to avoid any out of service tracks, so let’s add a predicate to our shortest path eliminating any tracks with an “X” value for their network property:
// From Chicago to Milwaukee Shortest Number of Relationships on valid parts of the network MATCH (chicago:Junction {fra_node_id: "414657"}), (milwaukee:Junction {fra_node_id: "412167"}), p = shortestPath((chicago)-[:LINKS*]-(milwaukee)) WHERE NONE( x IN relationships(p) WHERE x.network = "X" ) RETURN p
Neat! But we don’t know how many miles it took to get the shortest path, let’s add that to our query:
// From Chicago to Milwaukee Shortest Number of Relationships on valid parts of the network with mileage MATCH (chicago:Junction {fra_node_id: "414657"}), (milwaukee:Junction {fra_node_id: "412167"}), p = shortestPath((chicago)-[:LINKS*]-(milwaukee)) WHERE NONE( x IN relationships(p) WHERE x.network = "X" ) RETURN p, reduce(totalMiles = 0, n IN relationships(p) | totalMiles + n.miles) AS miles
That’s not that far, lets try to go from Chicago to San Francisco instead:
// From Chicago to San Francisco Shortest Number of Relationships on valid parts of the network with mileage MATCH (chicago:Junction {fra_node_id: "414657"}), (san_francisco:Junction {fra_node_id: "306128"}), p = shortestPath((chicago)-[:LINKS*]-(san_francisco)) WHERE NONE( x IN relationships(p) WHERE x.network = "X" ) RETURN p, reduce(totalMiles = 0, n IN relationships(p) | totalMiles + n.miles) AS miles
That’s close to 2500 miles of rail to go from those two cities. But hold on a second. You may have noticed it is returning the shortestPath in terms of the number of relationships it had to traverse, not the shortest path in terms of miles. Here is where you may be tempted to calculate all the paths and then return the best one(s) ordered by miles, but that would be a terrible idea. There are so many potential paths, we may reach the heat death of the universe before the answer comes back.
What we want to do instead is use Dijkstra’s shortest weighted path algorithm to get the answer. Luckily that’s already built in for us in Neo4j, we just have to make use of it. To try to make it smart we’re going to try a few things. For one, we don’t want to stray too far off target so we will limit the Junctions we visit to be at most 20% further away than the distance of the source junction to the destination junction.
@Procedure(name = "com.maxdemarzi.routes", mode = Mode.READ) @Description("CALL com.maxdemarzi.routes(Node from, Node to, Number limit)") public Stream<WeightedPathResult> routes(@Name("from") Node from, @Name("to") Node to, @Name("limit") Number limit) { // Prevent going to any Junction more than 120% distance away from source to destination double maxCost = getCost(from, to) * 1.2; ValidPathExpander validPaths = new ValidPathExpander(maxCost, latitudes.get(to), longitudes.get(to)); PathFinder<WeightedPath> dijkstra = GraphAlgoFactory.dijkstra(validPaths, CommonEvaluators.doubleCostEvaluator(MILES), limit.intValue() ); ArrayList<WeightedPathResult> results = new ArrayList<>(); for(WeightedPath path : dijkstra.findAllPaths(from, to)) { results.add(new WeightedPathResult(path, path.weight())); } return results.stream(); }
In our Expander, we will make sure we don’t go over any tracks that are out of service just like before.
@Override public Iterable<Relationship> expand(Path path, BranchState branchState) { List<Relationship> rels = new ArrayList<>(); // Make sure we don't get on a track that is out of service: for (Relationship r : path.endNode().getRelationships(RelationshipTypes.LINKS)) { if (!(r.getProperty(NETWORK)).equals(OUT_OF_SERVICE)) {
…and we will make sure we don’t go above the maximum cost we calculated before.
if (getCost(latitudes.get(from), longitudes.get(from), latitude, longitude) <= maxCost) { rels.add(r); valid.add(from.getId()); } else { invalid.add(from.getId()); }
After we compile this stored procedure and add it to our Neo4j instance, we can try it:
MATCH (chicago:Junction {fra_node_id: "414657"}), (milwaukee:Junction {fra_node_id: "412167"}) CALL com.maxdemarzi.routes(chicago, milwaukee, 10) YIELD path, weight RETURN path, weight
That’s a bit hard to digest, how about we just show the length of the paths?
MATCH (chicago:Junction {fra_node_id: "414657"}), (milwaukee:Junction {fra_node_id: "412167"}) CALL com.maxdemarzi.routes(chicago, milwaukee, 10) YIELD path, weight RETURN length(path), weight
Awesome, it figured it out in under 100 milliseconds. Now how about Chicago to San Francisco again?
MATCH (chicago:Junction {fra_node_id: "414657"}), (san_francisco:Junction {fra_node_id: "306128"}) CALL com.maxdemarzi.routes(chicago, san_francisco, 10) YIELD path, weight RETURN length(path), weight
That took about 20 seconds, what if we just want the shortest path?
MATCH (chicago:Junction {fra_node_id: "414657"}), (san_francisco:Junction {fra_node_id: "306128"}) CALL com.maxdemarzi.routes(chicago, san_francisco, 1) YIELD path, weight RETURN length(path), weight
Just under 3 seconds. Pretty neat right? The source code as always is available on github.
Hi Max, thanks for this post. I was wondering how you are matching fra_node_id to actual junction name?
Dima
I haven’t found out how to match fra_node_id and junction name directly. The field STCYFIPS seems to make sense in the context of https://transition.fcc.gov/oet/info/maps/census/fips/fips.txt though.