CFG stands for context-free grammar: the model's decoding is constrained so it can only emit SQL this schema allows. Every prompt runs twice — with the grammar (CFG) and without (no CFG). 21 answerable prompts + 10 out-of-scope prompts, each in both modes, = 62 trials. New to the vocabulary? The About page defines every term and explains how the grading works.
0/62 trials
Nothing has run yet — hit Run all (or any row) and this fills in as results land.
Where constraints matter most
Constrained correctness by tier
Each trial is graded on three axes: it executes, matches a live ClickHouse run of the reference SQL, and stays inside the schema whitelist. Eval 5 has no section of its own — it's the CFG / no CFG comparison in every row here and in the scoreboard.
The schema has no weather, drivers, vehicles, lat/long, or PII. The constrained path must decline via cannot_answer; the unconstrained baseline shows what fabrication looks like.
Does the query return the right answer?
21 prompts across easy / medium / hard tiers — including an adversarial slice engineered to tempt schema drift. The reference query executes live per trial; the model's result set is diffed against that answer — not the SQL text. Any semantically equivalent query passes.
Does every generated query run without errors?
Every constrained output must execute without error. An execution failure is a grammar failure — the CFG accepted a token sequence ClickHouse rejected.
Does the SQL only use columns and functions that actually exist?
Every identifier is validated against the live schema whitelist. Hallucinated columns and functions are structurally impossible under the grammar.
Does the model decline questions the data can't answer?
Out-of-scope prompts must be declined via cannot_answer. Half are phantom columns — real NYC-TLC fields (mta_tax, VendorID) absent from this 13-column subset — which look answerable. The baseline doesn't decline: it answers anyway with a degenerate query (SELECT 0, WHERE 1=0) that renders as a confident wrong number.
Same prompts, with and without the grammar — the two result columns in every row.
Not a separate section: every prompt runs twice, and this comparison is the CFG / no CFG columns in each row and the paired scoreboard bars. On clean prompts a strong base model already nails the SQL, so both modes hit 100% — the grammar separates on the adversarial and phantom-column slices: schema grounding and refusal, the failure modes CFG forecloses by construction.