Build text-to-SQL eval runner and deterministic scorer
Problem
Build a Dataface text-to-SQL eval harness that runs agent/model prompts against the cleaned benchmark and scores outputs without requiring live warehouse execution. Uses deterministic checks for structural correctness and LLM-as-judge for semantic equivalence, while keeping the generation interface generic enough for different schema-tooling strategies.
Context
Repo boundary
Everything lives in Dataface under apps/evals/sql/. The cleaned benchmark (task 2) is checked into apps/evals/data/. Runner, scorer, backend factories, and types live in apps/evals/sql/. Results go to apps/evals/output/sql/.
All eval types (SQL, catalog discovery, agent/dashboard) share one apps/evals/ directory with a unified CLI entry point, shared data, and shared leaderboard dashboards. See the leaderboard task for the full layout.
The only external input is the raw dataset JSONL from cto-research, which is already imported by the cleaning step (task 2). The runner has no dependency on cto-research at runtime.
If other teams want to compare their agents against the same benchmark, they can use the same raw JSONL and the same documented scoring rubric — they don't need to import the Dataface runner. The sharing contract is the benchmark file and the rubric, not shared code.
No warehouse execution needed — by design, not as a shortcut
The eval runner scores generated SQL against gold SQL. It does not run queries against a live warehouse. This is the permanent design:
- The gold SQL is synthetic (LLM-generated from schema metadata), but the benchmark-prep task should at least validate it with
sqlglotBigQuery parsing and, when credentials are available, optional BigQuery dry-run validation. - Result-set comparison is expensive, brittle, and would require data fixtures that don't exist.
- Structural + semantic scoring answers the questions we care about: does the model produce valid, grounded, intent-matching SQL given this context?
Scoring architecture: hybrid deterministic + LLM-as-judge
Two scoring layers, applied in sequence. No AST-based intent comparison — replaced entirely by LLM-as-judge which is simpler to build and more accurate for structurally-different-but-semantically-equivalent SQL.
Layer 1: Deterministic structural checks (fast, cheap, exact)
- parse_ok: SQL parse success via sqlglot — binary pass/fail.
- grounding_ok: optional grounding validation via a pluggable resolver that can answer "does this table/column exist?" using whatever source that backend relies on (profiled catalog, filtered tool output, INFORMATION_SCHEMA, MCP, checked-in memory, etc.).
- Pre-filters: if a case fails parse or grounding, skip the LLM judge and score it as a structural failure.
Layer 2: LLM-as-judge for semantic equivalence (the hard part)
- Given the question, the gold SQL, and the generated SQL — are they answering the same question?
- Handles what AST comparison cannot: structurally different but semantically equivalent SQL (subquery vs JOIN, WHERE vs HAVING, CTE vs inline), ambiguous questions with multiple valid interpretations, and partial credit for complex queries.
- Scoring output: categorical equivalence field plus free-text rationale.
Why not pure LLM-as-judge? The deterministic layer is instant and perfectly reliable for what it checks. No reason to spend an LLM call on "does this SQL parse?"
Scoring model
Per-case output record:
{
"case_id": "...",
"question": "...",
"gold_sql": "...",
"generated_sql": "...",
"parse_ok": true/false,
"grounding_ok": true/false/null,
"hallucinated_tables": [...],
"hallucinated_columns": [...],
"equivalence": "equivalent" | "partially_equivalent" | "different" | "skipped",
"rationale": "...",
"pass": true/false,
"backend": "...",
"backend_metadata": { ... },
"schema": "...",
"complexity": "...",
"category": "..."
}
backend is the function identifier (e.g. "raw_llm", "dataface", "cto_research.agent:generate"). backend_metadata is an arbitrary dict the generation function can optionally return alongside the SQL — model name, prompt version, context level, latency, token count, whatever the backend wants to log. The runner passes it through to the output untouched. This keeps the runner generic while still capturing rich per-backend reporting dimensions.
To support this, the generation function can optionally return a richer result:
@dataclass
class GenerationResult:
sql: str
metadata: dict[str, Any] = field(default_factory=dict)
# GenerateFn can return either a plain str or a GenerationResult
GenerateFn = Callable[[BenchmarkCase], Awaitable[str | GenerationResult]]
Returning a plain string still works — the runner just treats metadata as empty. Backends that want richer reporting return GenerationResult with whatever metadata they want tracked.
grounding_ok=null means grounding was not evaluated for that backend/run. pass is the headline binary: true if parse_ok, no enabled grounding check failed, and equivalence in (equivalent, partially_equivalent). This is what the leaderboard primarily shows.
Aggregate metrics per run:
- pass_rate: fraction of cases where pass=true
- parse_rate: fraction where parse_ok=true
- grounding_rate: fraction where grounding_ok=true among cases where grounding was evaluated
- equivalence_rate: fraction where equivalence=equivalent (strict, excluding partial)
- All sliceable by the reporting dimensions.
No weighted composite score. The individual fields are more useful than a single number — you can see whether a model fails at parsing, grounding, or intent, and drill into each.
Generation interface (maximally generic)
The runner takes a generation function — a black box that accepts a benchmark case and returns SQL. The function owns everything: context assembly, LLM calls, tool use, multi-step reasoning, whatever. The runner doesn't know or care what happens inside.
Do not bake a standard schema-metadata payload into BenchmarkCase. The benchmark case carries the question and benchmark labels; the backend decides how to fetch or construct schema context. One backend might call Dataface's profiled catalog, another might query INFORMATION_SCHEMA, another might call an MCP tool, another might use a checked-in memory file, and another might use no schema tool at all.
@dataclass
class BenchmarkCase:
case_id: str
question: str
gold_sql: str
schema: str
complexity: str
category: str
metadata: dict[str, Any] = field(default_factory=dict)
GenerateFn = Callable[[BenchmarkCase], Awaitable[str]]
That's it. The function takes a BenchmarkCase, returns a SQL string. Everything else — which model, what prompt, how much schema context to include, whether to use tool-calling agents, whether to do chain-of-thought, whether to call an external API, and how schema information is obtained — is the function's problem. The runner just calls it and scores the output.
This means the cto-research team (or anyone else) can write their own function, wrap their entire agent pipeline inside it, and plug it into the runner with zero changes to the eval harness. They don't need to match Dataface's context format or LLM client.
Backend construction: factory, not one-function-per-experiment
You don't want to write generate_sql_gpt4o_full_context, generate_sql_claude_names_only, etc. — those are near-identical functions that differ by one or two config values. But you also don't want the runner itself to know about model/provider/context-level, because that breaks the black box for external teams.
The answer: a factory that produces GenerateFn closures. The runner stays dumb. The factory is a convenience for Dataface's built-in backends and can accept a generic context/tooling strategy rather than one standard schema blob.
def make_raw_llm_backend(model: str = "gpt-4o", provider: str = "openai") -> GenerateFn:
async def generate(case: BenchmarkCase) -> str | GenerationResult:
sql = await call_llm(model, provider, system_prompt, case.question, ...)
return GenerationResult(sql=sql, metadata={"model": model, "provider": provider})
return generate
def make_dataface_backend(
model: str = "gpt-4o",
context_config: str = "full",
context_provider: ContextProvider | None = None,
) -> GenerateFn:
async def generate(case: BenchmarkCase) -> str | GenerationResult:
sql = await dataface_generate_sql(
case.question,
model=model,
context_config=context_config,
context_provider=context_provider,
)
return GenerationResult(
sql=sql,
metadata={"model": model, "context_config": context_config},
)
return generate
The CLI passes --model and --provider to the factory, not to the runner. The runner receives the constructed function and never sees those args:
python -m apps.evals sql --backend raw_llm --model gpt-4o --provider openai ...
python -m apps.evals sql --backend dataface --model claude-sonnet --context-level names_only ...
python -m apps.evals sql --backend path.to.module:function_name ...
For built-in backends, --model and --provider are standardized convenience args that the CLI routes to the factory. For external backends (path.to.module:function_name), these are ignored — external functions are already fully configured by whoever wrote them. The external team can use env vars, their own config files, or just hardcode values.
This gives us:
- No function proliferation — one make_raw_llm_backend handles all model/provider combos
- Runner stays a black box — it never sees model/provider, just the constructed function
- Two standard CLI args (--model, --provider) that cover 90% of experiments without extra ceremony
- External teams unaffected — their functions are self-contained, the standard args don't apply
- Metadata flows automatically — the factory puts model/provider into GenerationResult.metadata, which shows up in the leaderboard
- Schema-tool experiments stay possible — built-in factories can swap context providers, filter fields, or disable tools entirely without changing the runner contract
Why just --model and --provider? Those are the two dimensions you'll vary most often and they're universal across LLM backends. Other knobs (context level, prompt version, temperature, etc.) are backend-specific and belong in the factory, not the CLI. You can always add more standard args later if a pattern emerges, but starting with just two keeps it clean.
External backends (others bring their own):
Any team can write a function that conforms to GenerateFn and point the runner at it:
# In cto-research or wherever — fully self-contained
async def cto_research_agent(case: BenchmarkCase) -> str | GenerationResult:
sql = await their_agent_pipeline(case.question)
return GenerationResult(
sql=sql,
metadata={"model": "their-model", "pipeline": "v2", "tools_used": True},
)
CLI:
python -m apps.evals sql --backend path.to.module:cto_research_agent ...
Runner design
- Inputs: benchmark JSONL path (defaults to
apps/evals/data/benchmark.jsonl), backend (built-in name ormodule:function), optional--modeland--provider(routed to built-in backend factories, ignored for external), optional filters (--schema,--complexity,--category,--limit). The runner constructs theGenerateFnvia the factory, then treats it as opaque. Types (BenchmarkCase,GenerationResult,GenerateFn) live inapps/evals/sql/types.py. Backend factories live inapps/evals/sql/backends.py. - Generation step: For each benchmark case, call the generation function with the full
BenchmarkCase. The function returns SQL. - Scoring step: Apply layer 1, then layer 2 if layer 1 passes.
- Outputs: per-case JSONL results in
apps/evals/output/sql/(gitignored), per-run summary JSON with aggregate metrics sliced by reporting dimensions. - Concurrency: Async with configurable concurrency limit (like cto-research's
max_concurrentpattern).
Patterns to steal from cto-research
The cto-research eval harness (context_catalog/aisql_tuning/eval_harness.py) has battle-tested patterns worth porting:
- RecordingHooks for capturing tool calls, timing, and success/failure per case
- Per-question markdown reports with full conversation traces (useful for debugging failures)
- asyncio.gather with semaphore for concurrent eval execution
- Summary JSON with breakdowns by complexity
The search eval scorer (context_catalog/evals/search_eval/scoring.py) has clean dataclass patterns:
- QueryScore, RecordScore, EvalReport dataclasses
- _breakdown() helper that groups records by any key function and computes per-group metrics
- This is exactly what the summary/reporting layer needs
Reporting dimensions
- model / agent
- prompt version
- schema
- complexity
- category
Context-pack ablation and schema-tool ablation are first-class reporting dimensions. Since the generation function owns its own context assembly, ablation is done by parameterizing the backend factory: change the context provider, field filter, layer scope, or tool availability, then emit those choices via metadata. The runner doesn't need to know about the details.
Dependencies
- Depends on task 2 (cleaned benchmark artifact) for input data in
apps/evals/data/. sqlglotis needed as a new dependency — add to Dataface'spyproject.toml.- Downstream: task 4 (catalog discovery evals) derives from the same benchmark but uses a different scorer. The eval leaderboard dashboards consume runner output from
apps/evals/output/sql/. Shared reporting utilities (breakdown/aggregation helpers) go inapps/evals/shared/reporting.py.
Possible Solutions
Recommended: generic runner + backend factories - Keep the runner agnostic to model/provider/context assembly, accept a BenchmarkCase -> str | GenerationResult generator, and let backend factories encapsulate prompt/context strategy. This keeps the harness reusable for Dataface and external teams.
Alternative: hard-code a single Dataface generation path into the runner - rejected because it would couple eval orchestration to one prompt stack and make ablations awkward.
Plan
- Add
apps/evals/sql/with shared types, backend factories, deterministic grounding, hybrid scorer, and async runner. - Add a Typer CLI entry point that resolves built-in or external generator backends.
- Add shared aggregation helpers under
apps/evals/shared/for leaderboard-ready summaries. - Add
sqlglotto the project dependencies. - Add focused eval tests and verify with
just ci.
Implementation Progress
Runner and scoring
- Added
BenchmarkCase,GenerationResult,BackendSpec,JudgeSpec, andScoredCaseResulttypes. - Implemented backend factories for
raw_llm,dataface, andmodule:functioncallables. - Implemented deterministic parsing and optional grounding checks with
sqlglot. - Implemented LLM-as-judge scoring for semantic equivalence.
- Kept the runner generic around
BenchmarkCase -> str | GenerationResultand no live warehouse execution.
CLI and reporting
- Added
dft evals sqlcommand wiring. - Added shared breakdown/summarize helpers for run summaries.
- Added eval output directory ignore rule at
apps/evals/output/.
Validation
- Focused eval tests passed:
uv run pytest tests/evals/sql tests/evals/test_evals_cli.py tests/evals/test_prep_benchmark.py -q - Repo tool registry regression test passed:
uv run pytest tests/core/test_ai_tools.py -q - Full repo gate passed:
just ci
Follow-up fix
- Removed the extra
information_schemahandler alias fromdataface/ai/tools.pyso the tool registry matchesALL_TOOLSagain.
QA Exploration
- [x] QA exploration completed (or N/A for non-UI tasks)
N/A for browser QA. Validation: ran the eval test slice against canary/fixture inputs, verified JSONL output schema and summary generation, checked that parse failures skip the judge, and confirmed the runner emits structured result records.
Review Feedback
- Review executed via
just reviewon the rebased branch afterjust cipassed. - Verdict: APPROVED with notes.
- Notes: external backend import paths and SQL response normalization are acceptable for an internal eval harness, but should be revisited if this surface ever becomes user-facing. Summary metrics intentionally treat
partially_equivalentas a pass and rely onequivalence_ratefor strict success visibility. - [x] Review cleared