Skip to content
CFG Eval
QueryEvalsAbout

Methodology

One eval per way the system can break.

1 · Execution correctness

Does the query return the right answer?

21 prompts across easy / medium / hard tiers — including an adversarial slice engineered to tempt schema drift. The reference query executes live per trial; the model's result set is diffed against that answer — not the SQL text. Any semantically equivalent query passes.

2 · SQL validity

Does every generated query run without errors?

Every constrained output must execute without error. An execution failure is a grammar failure — the CFG accepted a token sequence ClickHouse rejected.

3 · Schema adherence

Does the SQL only use columns and functions that actually exist?

Every identifier is validated against the live schema whitelist. Hallucinated columns and functions are structurally impossible under the grammar.

4 · Refusal

Does the model decline questions the data can't answer?

Out-of-scope prompts must be declined via cannot_answer. Half are phantom columns — real NYC-TLC fields (mta_tax, VendorID) absent from this 13-column subset — which look answerable. The baseline doesn't decline: it answers anyway with a degenerate query (SELECT 0, WHERE 1=0) that renders as a confident wrong number.

5 · CFG vs no-CFG head-to-head

Same prompts, with and without the grammar — the two result columns in every row.

Not a separate section: every prompt runs twice, and this comparison is the CFG / no CFG columns in each row and the paired scoreboard bars. On clean prompts a strong base model already nails the SQL, so both modes hit 100% — the grammar separates on the adversarial and phantom-column slices: schema grounding and refusal, the failure modes CFG forecloses by construction.

Grammar-constrained vs free-form text-to-SQL

CFG stands for context-free grammar: the model's decoding is constrained so it can only emit SQL this schema allows. Every prompt runs twice — with the grammar (CFG) and without (no CFG). 21 answerable prompts + 10 out-of-scope prompts, each in both modes, = 62 trials. New to the vocabulary? The About page defines every term and explains how the grading works.

What the result labels mean
pass
ran on ClickHouse, matched the reference answer, and stayed inside the schema.
wrong answer
ran, but the result didn't match the reference answer.
didn't execute
ClickHouse rejected the generated SQL.
off-schema
an answerable question, but the SQL referenced columns or functions the schema doesn't have.
declined
the model refused via cannot_answer — the correct outcome for out-of-scope questions.
answered anyway
an out-of-scope question got a confident fabricated answer instead of a refusal.
off-schema SQL
the fabricated answer also used invented columns — fabrication plus schema drift.

CFG vs no-CFG, live

0/62 trials

Nothing has run yet — hit Run all (or any row) and this fills in as results land.

Executes on ClickHouse
cfg
—
no cfg
—
Schema-grounded SQL
cfg
—
no cfg
—
Matches the reference answer
cfg
—
no cfg
—
Declines out-of-scope questions
cfg
—
no cfg
—

Where constraints matter most

Adversarial prompts — correct answer
cfg
—
no cfg
—
Adversarial prompts — schema-grounded
cfg
—
no cfg
—
Phantom columns — declined, not answered
cfg
—
no cfg
—

Constrained correctness by tier

easy
—
medium
—
hard
—
eval 1result correctnesseval 2SQL validityeval 3schema adherence

Labelled cases

Each trial is graded on three axes: it executes, matches a live ClickHouse run of the reference SQL, and stays inside the schema whitelist. Eval 5 has no section of its own — it's the CFG / no CFG comparison in every row here and in the scoreboard.

0/42 run
Question
CFGno CFG
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
eval 4refusal

Out-of-scope questions

The schema has no weather, drivers, vehicles, lat/long, or PII. The constrained path must decline via cannot_answer; the unconstrained baseline shows what fabrication looks like.

0/20 run
Question
CFGno CFG
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet
CFG:not run yet
no CFG:not run yet

How these evals work

1 · Execution correctness

Does the query return the right answer?

21 prompts across easy / medium / hard tiers — including an adversarial slice engineered to tempt schema drift. The reference query executes live per trial; the model's result set is diffed against that answer — not the SQL text. Any semantically equivalent query passes.

2 · SQL validity

Does every generated query run without errors?

Every constrained output must execute without error. An execution failure is a grammar failure — the CFG accepted a token sequence ClickHouse rejected.

3 · Schema adherence

Does the SQL only use columns and functions that actually exist?

Every identifier is validated against the live schema whitelist. Hallucinated columns and functions are structurally impossible under the grammar.

4 · Refusal

Does the model decline questions the data can't answer?

Out-of-scope prompts must be declined via cannot_answer. Half are phantom columns — real NYC-TLC fields (mta_tax, VendorID) absent from this 13-column subset — which look answerable. The baseline doesn't decline: it answers anyway with a degenerate query (SELECT 0, WHERE 1=0) that renders as a confident wrong number.

5 · CFG vs no-CFG head-to-head

Same prompts, with and without the grammar — the two result columns in every row.

Not a separate section: every prompt runs twice, and this comparison is the CFG / no CFG columns in each row and the paired scoreboard bars. On clean prompts a strong base model already nails the SQL, so both modes hit 100% — the grammar separates on the adversarial and phantom-column slices: schema grounding and refusal, the failure modes CFG forecloses by construction.

Scoreboard

0/62 trials

CFG vs no-CFG, live

0/62 trials

Nothing has run yet — hit Run all (or any row) and this fills in as results land.

Executes on ClickHouse
cfg
—
no cfg
—
Schema-grounded SQL
cfg
—
no cfg
—
Matches the reference answer
cfg
—
no cfg
—
Declines out-of-scope questions
cfg
—
no cfg
—

Where constraints matter most

Adversarial prompts — correct answer
cfg
—
no cfg
—
Adversarial prompts — schema-grounded
cfg
—
no cfg
—
Phantom columns — declined, not answered
cfg
—
no cfg
—

Constrained correctness by tier

easy
—
medium
—
hard
—