Generate formal grammars in GBNF format for specific SQLite databases for use with text-to-sql applications with llama.cpp
I want to do more RAG-type things with local data, but sometimes the data is widely varied and structured, and doesn't lend itself well towards pure semantic embeddings. Woudln't it be nice if our RAGs could retrieve local data intelligently? Enter, text-to-SQL models!
Text-to-SQL is great (in theory), but there are a couple of sticking points that make it still difficult:
-
Hallucinations are still a big problem. Providing the full database schema as part of the prompt can help, but this is not a perfect solution, and LLMs will still suggest invalid tables, columns, functions, and syntax from time to time. A standard practice is to run the query as-returned by the LLM, look for errors, and if there are errors, feed them back into the LLM (along with the original prompt and generated query) to get the LLM to attempt to fix its own bug. This is cool, but it would be nice if we could avoid this step as much as possible.
-
Prompt injection and database safety. Presumably you are querying your database with a user that only has read-only access, right? Even so, accidents do happen, and it would be nice to limit the queries generable by the LLM to be SELECT statements-only. But when nesting queries and chaining them together with semicolons and whatnot, it can be difficult to ensure that a little Bobby Tables isn't sneaking into our queries maliciously.
In late 2023, llama.cpp added a fantastic feature where the tokens generated by the LLM can be constrained to only choose amongst tokens that are valid as-defined by a formal grammar specification file. The system simply declines to consider any tokens that don't meet the specification file, which is a very speedy and efficient way of weeding out invalid generations -- you never generate them in the first place!
Example GBNF files are provided for things like JSON, chess moves, c programs, etc. Super cool!
But can we use this for our purposes? Yes, we can!
By generating a GBNF specification for SQL queries, we can ensure that all queries generated by the LLM are syntactically correct.
Further, by adapting this generalized SQL grammar to our specific SQLite database, we can restrict the tables and columns accessed to those that are defined in our database. This should reduce the problem of hallucinated tables and columns in our generated queries.
sqlite_select.gbnf
can be used out-of-the-box with zero modification to improve the quality of your LLM-generated SQL statements.
If you want to restrict your generated SQL to your specific database, you can run sqlite2gbnf.py, point it at your local .sqlite file, and it will auto-populate an amended version of sqlite_select.gbnf with information specific to your database schema.
This is obviously intended to work with local models that can do text-to-SQL tasks. Some candidates one might consider are:
Though as the field of interest in text-to-SQL models grows, this list should continue to grow.
./gbnf-validator yeschat.ai.gbnf tests/sql1.txt
Input string is valid according to the grammar.
./gbnf-validator yeschat.ai.gbnf tests/sql2.txt
Input string is valid according to the grammar.
./gbnf-validator yeschat.ai.gbnf tests/sql3.txt
Input string is valid according to the grammar.
./gbnf-validator yeschat.ai.gbnf tests/sql4.txt
Input string is valid according to the grammar.
./gbnf-validator yeschat.ai.gbnf tests/sql5.txt
Input string is valid according to the grammar.
./gbnf-validator yeschat.ai.gbnf tests/sql6.txt
Input string is valid according to the grammar.
./gbnf-validator yeschat.ai.gbnf tests/sql7.txt
Input string is valid according to the grammar.
./gbnf-validator yeschat.ai.gbnf tests/sql8.txt
Input string is valid according to the grammar.
./gbnf-validator yeschat.ai.gbnf tests/sql9.txt
Input string is invalid according to the grammar.
Error: Unexpected character '
' at position 28
Input string:
SELECT order_id, order_date,
CASE
WHEN DATE_PART('month', order_date) BETWEEN 1 AND 3 THEN 'Q1'
WHEN DATE_PART('month', order_date) BETWEEN 4 AND 6 THEN 'Q2'
WHEN DATE_PART('month', order_date) BETWEEN 7 AND 9 THEN 'Q3'
ELSE 'Q4'
END AS quarter
FROM orders
./gbnf-validator yeschat.ai.gbnf tests/sql10.txt
Input string is invalid according to the grammar.
Error: Unexpected character '
' at position 58
Input string:
SELECT e.employee_id, e.first_name, e.last_name, e.salary,
(SELECT AVG(salary) FROM employees WHERE department_id = e.department_id) as dept_avg_salary,
RANK() OVER (PARTITION BY e.department_id ORDER BY e.salary DESC) as salary_rank
FROM employees e