Hey Database! Go Fine Tune Yourself

Everyone expects AI prices to go down in the long term. But in the short term, we have three things going on. Token prices keep dropping, hurray for that. Subscription fees are going up and dumping their all you can eat plans for volume based pricing. There more you use, the more you pay. I guess that’s fair. Third, hardware component pricing is going up and big companies are borrowing billions to build the greatest and latest AI data centers. What’s going on? Are we in the pets.com era of selling $40 dollars worth of dog food for $20 bucks and making it up in volume? The real question is, how do we close this giant chasm of a value gap?

Molham Aref argues that enterprises must make agents smarter and cheaper. We have to solve two problems at the same time: making agents smart enough to handle real business decisions, and ensuring they are cost-effective enough to scale enterprise-wide. It sounds simple enough on the surface, but… it’s not. I’m going to talk about one of the ways we are doing that. But before I start, about six months ago, Greg Diamos and Naila Farooqui at RelationalAI wrote a blog post “Introducing Superalignment for Relational Databases“. If you haven’t read it yet, please take the time to do it now or you may be a little lost on what follows. There is a line in there people sometimes overlook, even thought it’s literally highlighted in bold:

The training dataset is the database itself.

The high level version is that RelationalAI takes a look at your database, your schema, your data, your business rules, your queries and fine tunes an LLM with your specific attributes. Then when you need to ask your database hard questions, you route them to your smaller smarter model which writes the queries needed to answer them. Because you’ve post-trained the LLM on your specific data, you can stop wasting tokens telling it what the schema is or how it’s connected, or what this field or that field mean, and so on. It saves you a ton of money and it’s more likely to give you good queries that answer your questions correctly. Now here comes the interesting part. One of the things we are doing to fine tune LLMs is generating bad queries.

If the whole point of Text to SQL is to generate GOOD queries that answer the question, why do we need BAD ones? Because there is this fine tuning concept called Direct Preference Optimization. I already asked you to read two blog posts, so I’m not going to ask you to read an academic paper, instead just think of Morpheus from the movie “The Matrix”.

In the movie, Morpheus tells Neo, “You take the blue pill, the story ends… You take the red pill, you stay in Wonderland.” In DPO, it enforces this same binary constraint. DPO completely skips the step of calculating a floating-point “score” for the text. It has to choose between the two options. The model’s active weights represent Neo standing in front of the choice. Neo takes the red pill. By penalizing the blue pill, your network learned to drop its log-probability, making it harder for the model to choose that option in production.

Question: Which users are we able to sell alcohol to in the United States? The correct answer is the red pill, the almost correct answer is the blue pill.

Alright, now you know why we need bad queries. Not invalid queries mind you, the LLMs are way past the stage where you ask them to answer a database question and they give you the recipe for Portuguese Octopus or an invalid SQL query. In fact, we are at the stage where we can ask the LLMs to tell us exactly how they mess up SQL queries and use that information to make our bad queries ourselves.

Obviously I can’t show you exactly what we do at RelationalAI, but I built a Temu version of it using VillageSQL again. Introducing the vsql-corruptor extension. If you missed my previous blog post, VillageSQL is a MySQL change tracking fork that adds the ability to make extensions to the database quite easily.

The way it works is you pass in a SQL query and optionally you can specify which SQL corruption type you want to apply or leave it blank and it will randomly pick one, and you can either pass in a database schema in text or pass in the name of an existing database name in your VillageSQL server and it will get the schema from that database automatically.

SELECT vsql_corrupt(
  'SELECT first_name FROM customers WHERE first_name = \'Alice\'',
  'TYPE_INCOMPATIBILITY',
  'customers:first_name VARCHAR,id INT'
);
-- Returns: SELECT first_name FROM customers WHERE (first_name = 1234)

That’s pretty neat. Or you can create a database with some tables:

CREATE DATABASE test_corrupt_db;
USE test_corrupt_db;

CREATE TABLE customers (
id INT PRIMARY KEY,
first_name VARCHAR(50),
last_name VARCHAR(50)
);
CREATE TABLE orders (
id INT PRIMARY KEY,
customer_id INT,
order_date DATE,
amount DECIMAL(10,2)
);

Then use the name of the database in your query and get the same result. This comes in handy when you have a ton of tables and columns and don’t want to be bothered with writing out the schema:

SELECT vsql_corrupt(
'SELECT first_name FROM customers WHERE first_name = \'Alice\'', 
'TYPE_INCOMPATIBILITY',
 'test_corrupt_db');

It adds just two functions to your database. One is the vsql_corrupt function which triggers it and the second is vsql_schema_cache_ready. We need this because we are going to cache the database schemas in memory on the database instead of querying it every time you ask for a SQL query to be corrupted. You can see these in vsql_corruptor.cc. Now we are going to be smarter than trying to edit the text of a query via some regex magic. Instead we are going to parse and convert that SQL query to an AST using the Hyrise SQL Parser. Then in our Corruption Engine, we will make modifications to the query depending on each type of corruption. Here is an example that screws up and puts in the wrong aggregation.

if (type == CorruptionType::WRONG_AGGREGATION) {
        if (expr->type == hsql::kExprFunctionRef) {
            if (schema::caseInsensitiveCompare(expr->name, "MAX")) {
                free(expr->name);
                expr->name = strdup("MIN");
            } else if (schema::caseInsensitiveCompare(expr->name, "MIN")) {
                free(expr->name);
                expr->name = strdup("MAX");
            } else if (schema::caseInsensitiveCompare(expr->name, "SUM")) {
                free(expr->name);
                expr->name = strdup("AVG");
            } else if (schema::caseInsensitiveCompare(expr->name, "AVG")) {
                free(expr->name);
                expr->name = strdup("SUM");
            }
        }

There are 37 ways LLMs have been known to corrupt SQL queries accounted for in this extension. In the real world, there are countless ways to combine these 37 and more to make a great big mess of a SQL query. For Java folks out there, you can build the same thing using SQLancer and Calcite. That’s what I originally did before asking Gemini to vibe code this into C++ and stick it on a VillageSQL extension. Hey that’s the second VillageSQL post, are you shilling for them now? No man, I’m doing it for the love of the game. Now on to an example:

{"prompt": "List all customers and their order IDs, including customers who haven't placed any orders yet.",
 "chosen": "SELECT c.customer_name, o.order_id FROM customers c LEFT JOIN orders o ON c.customer_id = o.customer_id;", 
 "rejected": "SELECT c.customer_name, o.order_id FROM customers c INNER JOIN orders o ON c.customer_id = o.customer_id;"}
{"prompt": "Find the total sales amount for the 'Electronics' department. Note: Department names in the DB are capitalized.", 
  "chosen": "SELECT SUM(amount) FROM sales WHERE department = 'Electronics';", 
  "rejected": "SELECT SUM(amount) FROM sales WHERE department = 'electronics';"}
{"prompt": "Show the names of the 3 oldest employees.", 
  "chosen": "SELECT employee_name FROM employees ORDER BY hire_date ASC LIMIT 3;", 
  "rejected": "SELECT employee_name FROM employees ORDER BY hire_date DESC LIMIT 3;"}
{"prompt": "Get a list of unique cities where we have active stores.", 
  "chosen": "SELECT DISTINCT city FROM stores WHERE status = 'active';", 
  "rejected": "SELECT city FROM stores WHERE status = 'active' GROUP BY city, store_id;"}

What does the model learn from these question, good query, bad query triplet?

Join Selection (Pair 1): The rejected query uses an INNER JOIN, which executes perfectly but incorrectly filters out customers without orders. The chosen query correctly uses a LEFT JOIN to satisfy the prompt’s condition.
Case Sensitivity (Pair 2): Both queries run, but the rejected one returns a NULL or 0 result because it uses a lowercase string on a case-sensitive data.
Logic Reversal (Pair 3): Sorting by DESC instead of ASC runs flawlessly but returns the 3 newest employees instead of the oldest.
Granularity & Duplication (Pair 4): The rejected query groups by store_id, which means it will output duplicate city names if a city has multiple stores. The chosen query correctly uses DISTINCT to return a true unique list of cities.

Anyway, once you have your user questions, good queries and bad queries, you can use TRL which supports DPO to train your model or whatever library you prefer. Give it a try for your personal projects. If you want to do this for your Enterprise data, contact RelationalAI for a far more sophisticated approach.

Max De Marzi

Graphs, Graphs, and nothing but the Graphs