If you’ve come to any of the Neo4j Data Modeling classes I’ve taught, you’ve must have heard me say “your model depends on both your data and your queries” about a million times. Let us take a closer dive into what this means by looking at how one might model airline flight data in Neo4j.
So what is our data… Airports and Flights between them. Let’s start our model with that:
Right away this model feels a bit off. The concept of a Flight is expressed as a relationship but if we want to connect Customers or Staff to a flight, or say that a flight was REROUTED to another Airport due to weather or any kind of problem, we can’t. Given some of the queries we imagine for our data a flight really should be an instance or an event and thus be a Node, so let’s try this model instead:
You may have seen this model before in a graphgist by Nicole White or in the Neo4j Graph Data Modeling book by Mahesh Lal.
It’s not a bad model, but we will have very dense nodes. Think about major hubs like Atlanta, Beijing, Dubai, London Heathrow, or even my local Chicago O’Hare. These would be very massive nodes with no quick way to filter routes without checking multiple properties which will slow our traversal down quite a bit.
Let us use the queries to guide our model. When a user is trying to book a flight, they know where they are starting from, where they want to go, and what day they want to fly. So lets reimagine our model to introduce the concept of days. There is a variety of ways to model dates and times in a graph, but our queries are telling us we should find away to limit our traversals to a small subgraph, so we will create nodes to identify the subgraph we want. Each Airport would have 365 AirportDay nodes so we can book and schedule flights up to a year in advance.
We added days, but our model didn’t really improve our queries just yet. We are still checking the date property on all the AirportDay nodes. We can move the date property to the HAS_DAY relationship to save ourselves from having to traverse all the way to the AirportDay node when starting from an Airport, but there is another way:
We are using the date as an actual relationship type, so we could start from an Airport node, and quickly jump to the AirportDay node by relationship type without having to check the date property of 365 relationships. This is an important concept to understand when designing your model. The less work the traversal has to do the faster it will be. Checking a few hundred relationship properties is more expensive than traversing a single relationship type. This however begs the question “why start at Airport, when we can just start at AirportDay”? Indeed:
We can use a key like “ORD-1441065600” to quickly get to the Chicago O’Hare Airport for 9/1/2015 via an index and start our traversal there. The key is made up of our departing airport code and the linux epoch time representation of the day of our flight. I think this is as good as we are going to get for finding our starting point in the traversal. Let’s now start thinking about when our traversal would finish. This is of course once we found a flight that reached our intended destination. However with the current model in order to check if we reached our destination we have to go through every flight an Airport flies to that day and that is just not good for performance. Imagine your traversal finding flight paths with one stop or two stops via major hubs, that would be quite painful. This is where the best part of working with Neo4j comes in… you have to get creative.
So lets try something crazy. We know there may be a couple thousands flights at an airport any day, but very few airports have more than 200 destinations, so lets add Destination nodes to every AirportDay. We may have 100 flights from Chicago to Atlanta per day, but we should only have to check the destinations once. If we can’t find the destination we want, we can stop traversing through this AirportDay node right away and try a different route. We don’t have to check all the flights of every AirportDay node we encounter in a multi-stop traversal. Most of the time we are booking flights way in advance but the other travelers that need to book a flight are those who missed their flight, missed their connection, have to deal with a cancelled flight or any such change in their plans. I am tempted to make destinations an array property and just check there, but I left it as a node because this model has the added benefit that if all flights from Chicago to Atlanta are delayed or cancelled due to weather, we can edit just the Destination node and affect them all. Let the queries guide your model. So our model is now:
I like this model, but maybe we can do a little bit better. Frequent flyers tend to book flights with their favorite airline whenever possible to earn Flight Miles or Points. Usually people tend to take the return route on the same airline they took to get there. So lets take the “date as relationship type” idea we saw earlier and try “airline as relationship type”:
With this new model we can start from an AirportDay, check the Destinations to see if a non-stop or direct flight is available right away. If not, we can look at routes with one hop, and quickly check the Destination nodes of those AirportDays to see if they can reach our destination at all from our hop. If the user is willing to make a two-hop flight, we can then check those the same way. We can check the flights in our preferred Airline order, or even limit our traversal to just the Airlines we choose.
I think this is a good model, but maybe you can come up with a better one. Think about it, and if you do, please let me know. If you want to see many different models for other industries and uses cases, be sure to take a look at the GraphGists Collection.
In an upcoming post, I will introduce you to the Neo4j Traversal API and we’ll build a flight search query engine. Subscribe to my blog or follow me on twitter to be notified when its published.
Coming up with the right model and iterating through various scenarios like we did above is one of the things we can help you with in our Neo4j Technical Bootcamp. A week long program where we come on site and help you build a Proof of Concept so you can be confident your Neo4j project will be a success. So if you are on the fence about adding a graph database to your next project, lets chat. Make the investment. Give us a week and we’ll put you on the right path.
Enjoyed the thought process, albeit new to Neo4j; looking forward to the follow up. Thanks!
[…] have the model from the last blog post, so refer back to it if you haven’t seen it. So the first thing we want to do is create a […]
[…] Modeling Airline Flights in Neo4j by Max De Marzi […]
[…] it looks like routes between them go via the planet Sihnon. If you recall a set of past blog posts modeling and traversing flights, you may remember we can do this in Neo4j by creating a Traversal […]
[…] hidden in that model. If you want to see an example of a model emerging, take a look at my modeling airline flights blog post. Why did I choose to split the POST relationships by date, but not the FOLLOWS relationship? To […]
[…] long while back, I showed you different ways to model airline flight data. When it comes to modeling in graphs, the lesson to take away is that there is no right way. The […]
Nice post – really like the idea of activating the relationship types. How would this model deal with the common scenario (especially in international flights) where a flight starts on one day and lands on another? Do you build that into the query? Would some notion of trip length (time) as a property in a relationship help? that could move your pointer to the next day when appropriate – of course there’s more crunching as well…
The fights can land on any AirportDay. Next day or crossing the international date line going the other way too.
I’m finally getting into trying to build something similar and I’m realizing that Cypher doesn’t let you do this without using the APOC procedures, which complicates things a bunch. So I’m wondering how this is intrinsically better than using properties in the relationships for the variable information – you’re not really searching based on the properties, you’re just filtering against them. – thanks
excellent example. one of the very few data modeling examples in web showing options and an industry example
[…] In this model, it is easy to understand how many times an :ItemTag was scanned. Over time, a Scanner could easily become a supernode where we have millions of :SCANNED relationships attached to the node. Furthermore, if we want to find common patterns, we have to look at all :SCANNED relationships, order them by a property and then do comparisons. The amount of work increases as the amount of :SCANNED relationships increase. This is similar to the AIRPORT example in Max’s blogpost. […]
Very cool post! I just started learning about graphs, I am working on an airport/flights project and this is very helpful. It’s nice that you are citing other sources, and amazing that all the links work in 2020! Thanks!!
Look at slide 64 of https://www.slideshare.net/maxdemarzi/neo4j-training-modeling for an even better model.
Your articles are amazing and very helpful! Do you have the code for flight search with the model on slide 64? I know it should be very similar to your neoflights repository but I can’t update it so that it will work. Especially the part where HAS_DAY relationship changes to a dynamic date type relationship.