Declarative Query Languages are the Iraq War of Computer Science

It’s Memorial Day weekend in the United States. Some people are staying home, others are observing the holiday quietly and others still are using it as an excuse to party because they have seemed to have forgotten that the entire world is once again at war. At war with a tiny enemy, so small some people think it’s a hoax. The worst part is the enemy is in each other, our friends and neighbors. But Memorial day is not about remembering the wars, but rather remembering the fallen. To remember those who gave all. Whatever you may think of war, all are terrible, some were necessary. I never served, so that’s about all I get to say about that.

About 14 years ago Ted Neward wrote a very long blog post on “The Vietnam of Computer Science”. There is a follow up, and a short summary by Jeff Atwood as well. If you have never read them, I ask you to do so now…and with that, I believe Query Languages are the Iraq War of Computer Science.

What do I mean by that exactly? The Iraq War was based on a lie. Some of you may be too young, others may not know the gist of it. I would point you to wikipedia:

In 2004, the 9/11 Commission said there was no evidence of an operational relationship between the Saddam Hussein regime and al-Qaeda.[70] No stockpiles of WMDs or an active WMD program were ever found in Iraq.[71] Bush administration officials made numerous assertions about a purported Saddam-al-Qaeda relationship and WMDs that were based on sketchy evidence, and which intelligence officials rejected

…and for those who don’t like to read you can watch Vice or Official Secrets or any movie about the Iraq War (which are not 100% accurate but if you aren’t going to read anyway it’s fine).

With that out of the way what is the lie told by Declarative Query Languages?

In languages like SQL or Cypher (for example), developers are supposed to specify what is to be done instead of how to do it.

It doesn’t seem like there is a lie in that statement, but let’s dig in a little bit by going to another blog post from the past, this time from Joe Celko. In his blog post, Joe presents a real world database problem, a table structure, some sample data, and asks members of a SQL Server forum to try to write a query to answer the question while presenting his own solution.

Celko’s SQL Stumper: The Data Warehouse Problem

The problem is to report just those customers who decreased their purchase amounts on their most recent order placed with us. We are trying to get an idea when people are saturated with whatever we are selling.

This was his solution:

SELECT H1.customer_id, ' dropped purchase amount on ',
  FROM DailySalesTotals AS H1
 WHERE H1.order_amt 
       < (SELECT H2.order_amt
            FROM DailySalesTotals AS H2
           WHERE H1.customer_id = H2.customer_id 
             AND H2.order_date 
                 = (SELECT MAX(order_date) 
                      FROM DailySalesTotals AS H3
                     WHERE H1.customer_id = H3.customer_id 
                      AND H1.order_date > H3.order_date
  AND H1.order_date = (SELECT MAX( order_date)
                          FROM DailySalesTotals h4
                         WHERE h4.customer_id = H1.customer_id
 GROUP BY customer_id;

The forum entry that contained the queries and discussion has been lost to time, but there is this comment on that blog post that documents the leader board:

This is the current state of play, timing the execution time with 10,000 rows. Of the speedy routines, we are now getting some routines that are consistently faster than Celko’s. (46-50 ms). We have a bunch who are at or around 60 ms, and then a small band between 100 and 300 ms.

Joe’s promised to take a look at them all at the weekend, Obviously, speed isn’t everything in qualities we are looking for in this competition, so we’ll both be having a careful look, rank all the entries on various qualities and then, hopefully, we can announce a winner on Monday.

  1. 46ms Peso’s GianLuca-copy
  2. 46ms Gianluca Again
  3. 46ms mByther 2
  4. 50ms Gianluca Sartori
  5. 60ms KevRiley
  6. 60ms Paul Ireland
  7. 60ms Kevriley 4
  8. 60ms Celko
  9. 63ms Peso 3
  10. 63ms KevRiley 5
  11. 63ms giancula 3
  12. 106ms Eric Pratley
  13. 110ms Quan 2
  14. 173ms girish Bhat
  15. 280ms RobertFolkerts
  16. 283ms Wm Brewer
  17. 296ms Herman 2
  18. 313ms AndyM
  19. 330ms Peso
  20. 330ms ivan.yong
  21. 330ms Peso 4
  22. 340ms andriy.zabavskyy
  23. 343ms maxdemarzi
  24. 360ms Mark Marinovic
  25. 360ms mByther 1
  26. 376ms John McVay
  27. 390ms Peso 2
  28. 390ms Peter Brinkhaus
  29. 390ms plamen
  30. 423ms Alex Kuznetsov
  31. 436ms Gustavo 2
  32. 453ms back_of_the_fleet
  33. 453ms maxdemarzi
  34. 560ms v_paronov
  35. 610ms Steve Rowland
  36. 610ms eemore-571761
  37. 643ms Chris Howarth
  38. 690ms Herman van Midden
  39. 703ms jfortuny
  40. 736ms Quan_L_Nguyen
  41. 750ms Chris Howarth 2
  42. 763ms Gustavo
  43. 1030ms GSquared
  44. 10766ms Ramesh

I remember this blog post, because I entered two queries. The slow one won some award like “prettiest query” or something like that. It was 10x slower than the fastest entry. But count them. Count the number of entries… there are 44 on that list. So there must exist at least 44 different ways to express “report just those customers who decreased their purchase amounts on their most recent order”…and since there are 30 unique timings on that list then at least 30 different ways for the query planner to execute those declarations of “what to do” instead of “how to do it”. These queries range in performance from 46ms to 10 seconds. Just on 10 thousand rows of data. Can you imagine the range on 10 million rows of data? Some of those ways of declaring “what to do” do not scale. Now expand on this. Imagine you have tens of thousands of users, writing millions of queries. Most falling in the middle range in terms of performance. They “expect” the query planner to understand their intent and create the fastest possible plan for their query and data, but it usually doesn’t. It usually can’t. This is the lie we live with as database developers every day.

You may be tempted to think it’s just a Relational Database problem, but it’s not. We have this problem in Cypher too (and I would venture to guess any declarative query language of any database). There are armies of people made up of every database vendor trying to make their query planners “smarter”, take advantage of more statistics, provide valuable feedback like missing indexes, take “query hints”, etc. But to what end? Just so that developers end up telling it “how to do it” anyway?

So if declarative query languages are the problem what’s the solution?

Actual programming languages. One of the things I love about Neo4j is that I can use the Java API to declare “how to do it” instead of “what to do”. Java is 25 years old and also one of the most used programming languages in the world. While it’s not as concise as using a query language, I get the full power of a programming language to do the job. Including IDEs, debuggers, profilers, test frameworks, high performance libraries, etc. Since it’s on the JVM I could also run Clojure, Kotlin, Scala or any language on the JVM. With GraalVM I should be able to use Neo4j directly from Python, Ruby, Javascript, R, and more.

Tagged , , , , , , , , ,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: