Methodology

One eval per way the system can break.

1 · Execution correctness

Does the query return the right answer?

21 prompts across easy / medium / hard tiers — including an adversarial slice engineered to tempt schema drift. The reference query executes live per trial; the model's result set is diffed against that answer — not the SQL text. Any semantically equivalent query passes.

2 · SQL validity

Does every generated query run without errors?

Every constrained output must execute without error. An execution failure is a grammar failure — the CFG accepted a token sequence ClickHouse rejected.

3 · Schema adherence

Does the SQL only use columns and functions that actually exist?

Every identifier is validated against the live schema whitelist. Hallucinated columns and functions are structurally impossible under the grammar.

4 · Refusal

Does the model decline questions the data can't answer?

Out-of-scope prompts must be declined via cannot_answer. Half are phantom columns — real NYC-TLC fields (mta_tax, VendorID) absent from this 13-column subset — which look answerable. The baseline doesn't decline: it answers anyway with a degenerate query (SELECT 0, WHERE 1=0) that renders as a confident wrong number.

5 · CFG vs no-CFG head-to-head

Same prompts, with and without the grammar — the two result columns in every row.

Not a separate section: every prompt runs twice, and this comparison is the CFG / no CFG columns in each row and the paired scoreboard bars. On clean prompts a strong base model already nails the SQL, so both modes hit 100% — the grammar separates on the adversarial and phantom-column slices: schema grounding and refusal, the failure modes CFG forecloses by construction.

Grammar-constrained vs free-form text-to-SQL

CFG stands for context-free grammar: the model's decoding is constrained so it can only emit SQL this schema allows. Every prompt runs twice — with the grammar (CFG) and without (no CFG). 21 answerable prompts + 10 out-of-scope prompts, each in both modes, = 62 trials. New to the vocabulary? The About page defines every term and explains how the grading works.

What the result labels mean

pass: ran on ClickHouse, matched the reference answer, and stayed inside the schema.
wrong answer: ran, but the result didn't match the reference answer.
didn't execute: ClickHouse rejected the generated SQL.
off-schema: an answerable question, but the SQL referenced columns or functions the schema doesn't have.
declined: the model refused via cannot_answer — the correct outcome for out-of-scope questions.
answered anyway: an out-of-scope question got a confident fabricated answer instead of a refusal.
off-schema SQL: the fabricated answer also used invented columns — fabrication plus schema drift.

CFG vs no-CFG, live

0/62 trials

Nothing has run yet — hit Run all (or any row) and this fills in as results land.

Executes on ClickHouse

cfg

—

no cfg

—

Schema-grounded SQL

cfg

—

no cfg

—

Matches the reference answer

cfg

—

no cfg

—

Declines out-of-scope questions

cfg

—

no cfg

—

Where constraints matter most

Adversarial prompts — correct answer

cfg

—

no cfg

—

Adversarial prompts — schema-grounded

cfg

—

no cfg

—

Phantom columns — declined, not answered

cfg

—

no cfg

—

Constrained correctness by tier

easy

—

medium

—

hard

—

eval 1result correctnesseval 2SQL validityeval 3schema adherence

Labelled cases

Each trial is graded on three axes: it executes, matches a live ClickHouse run of the reference SQL, and stays inside the schema whitelist. Eval 5 has no section of its own — it's the CFG / no CFG comparison in every row here and in the scoreboard.

0/42 run

eval 4refusal

Out-of-scope questions

The schema has no weather, drivers, vehicles, lat/long, or PII. The constrained path must decline via cannot_answer; the unconstrained baseline shows what fabrication looks like.

0/20 run

How these evals work

1 · Execution correctness

Does the query return the right answer?

2 · SQL validity

Does every generated query run without errors?

Every constrained output must execute without error. An execution failure is a grammar failure — the CFG accepted a token sequence ClickHouse rejected.

3 · Schema adherence

Does the SQL only use columns and functions that actually exist?

Every identifier is validated against the live schema whitelist. Hallucinated columns and functions are structurally impossible under the grammar.

4 · Refusal

Does the model decline questions the data can't answer?

5 · CFG vs no-CFG head-to-head

Same prompts, with and without the grammar — the two result columns in every row.

Grammar-constrained vs free-form text-to-SQL

What the result labels mean

pass: ran on ClickHouse, matched the reference answer, and stayed inside the schema.
wrong answer: ran, but the result didn't match the reference answer.
didn't execute: ClickHouse rejected the generated SQL.
off-schema: an answerable question, but the SQL referenced columns or functions the schema doesn't have.
declined: the model refused via cannot_answer — the correct outcome for out-of-scope questions.
answered anyway: an out-of-scope question got a confident fabricated answer instead of a refusal.
off-schema SQL: the fabricated answer also used invented columns — fabrication plus schema drift.

How these evals work

1 · Execution correctness

Does the query return the right answer?

2 · SQL validity

Does every generated query run without errors?

Every constrained output must execute without error. An execution failure is a grammar failure — the CFG accepted a token sequence ClickHouse rejected.

3 · Schema adherence

Does the SQL only use columns and functions that actually exist?

Every identifier is validated against the live schema whitelist. Hallucinated columns and functions are structurally impossible under the grammar.

4 · Refusal

Does the model decline questions the data can't answer?

5 · CFG vs no-CFG head-to-head

Same prompts, with and without the grammar — the two result columns in every row.