Build text-to-SQL eval runner and deterministic scorer

ID	MCP_ANALYST_AGENT-BUILD_TEXT_TO_SQL_EVAL_RUNNER_AND_DETERMINISTIC_SCORER
Status	completed
Priority	p1
Milestone	m2-internal-adoption-design-partners
Owner	data-ai-engineer-architect
Completed by	dave
Completed	2026-03-18

Problem

Build a Dataface text-to-SQL eval harness that runs agent/model prompts against the cleaned benchmark and scores outputs without requiring live warehouse execution. Uses deterministic checks for structural correctness and LLM-as-judge for semantic equivalence, while keeping the generation interface generic enough for different schema-tooling strategies.

Context

Repo boundary

Everything lives in Dataface under apps/evals/sql/. The cleaned benchmark (task 2) is checked into apps/evals/data/. Runner, scorer, backend factories, and types live in apps/evals/sql/. Results go to apps/evals/output/sql/.

All eval types (SQL, catalog discovery, agent/dashboard) share one apps/evals/ directory with a unified CLI entry point, shared data, and shared leaderboard dashboards. See the leaderboard task for the full layout.

The only external input is the raw dataset JSONL from cto-research, which is already imported by the cleaning step (task 2). The runner has no dependency on cto-research at runtime.

If other teams want to compare their agents against the same benchmark, they can use the same raw JSONL and the same documented scoring rubric — they don't need to import the Dataface runner. The sharing contract is the benchmark file and the rubric, not shared code.

No warehouse execution needed — by design, not as a shortcut

The eval runner scores generated SQL against gold SQL. It does not run queries against a live warehouse. This is the permanent design:

The gold SQL is synthetic (LLM-generated from schema metadata), but the benchmark-prep task should at least validate it with sqlglot BigQuery parsing and, when credentials are available, optional BigQuery dry-run validation.
Result-set comparison is expensive, brittle, and would require data fixtures that don't exist.
Structural + semantic scoring answers the questions we care about: does the model produce valid, grounded, intent-matching SQL given this context?

Scoring architecture: hybrid deterministic + LLM-as-judge

Two scoring layers, applied in sequence. No AST-based intent comparison — replaced entirely by LLM-as-judge which is simpler to build and more accurate for structurally-different-but-semantically-equivalent SQL.

Layer 1: Deterministic structural checks (fast, cheap, exact) - parse_ok: SQL parse success via sqlglot — binary pass/fail. - grounding_ok: optional grounding validation via a pluggable resolver that can answer "does this table/column exist?" using whatever source that backend relies on (profiled catalog, filtered tool output, INFORMATION_SCHEMA, MCP, checked-in memory, etc.). - Pre-filters: if a case fails parse or grounding, skip the LLM judge and score it as a structural failure.

Layer 2: LLM-as-judge for semantic equivalence (the hard part) - Given the question, the gold SQL, and the generated SQL — are they answering the same question? - Handles what AST comparison cannot: structurally different but semantically equivalent SQL (subquery vs JOIN, WHERE vs HAVING, CTE vs inline), ambiguous questions with multiple valid interpretations, and partial credit for complex queries. - Scoring output: categorical equivalence field plus free-text rationale.

Why not pure LLM-as-judge? The deterministic layer is instant and perfectly reliable for what it checks. No reason to spend an LLM call on "does this SQL parse?"

Scoring model

Per-case output record:

{
  "case_id": "...",
  "question": "...",
  "gold_sql": "...",
  "generated_sql": "...",
  "parse_ok": true/false,
  "grounding_ok": true/false/null,
  "hallucinated_tables": [...],
  "hallucinated_columns": [...],
  "equivalence": "equivalent" | "partially_equivalent" | "different" | "skipped",
  "rationale": "...",
  "pass": true/false,
  "backend": "...",
  "backend_metadata": { ... },
  "schema": "...",
  "complexity": "...",
  "category": "..."
}

backend is the function identifier (e.g. "raw_llm", "dataface", "cto_research.agent:generate"). backend_metadata is an arbitrary dict the generation function can optionally return alongside the SQL — model name, prompt version, context level, latency, token count, whatever the backend wants to log. The runner passes it through to the output untouched. This keeps the runner generic while still capturing rich per-backend reporting dimensions.

To support this, the generation function can optionally return a richer result:

@dataclass
class GenerationResult:
    sql: str
    metadata: dict[str, Any] = field(default_factory=dict)

# GenerateFn can return either a plain str or a GenerationResult
GenerateFn = Callable[[BenchmarkCase], Awaitable[str | GenerationResult]]

Returning a plain string still works — the runner just treats metadata as empty. Backends that want richer reporting return GenerationResult with whatever metadata they want tracked.

grounding_ok=null means grounding was not evaluated for that backend/run. pass is the headline binary: true if parse_ok, no enabled grounding check failed, and equivalence in (equivalent, partially_equivalent). This is what the leaderboard primarily shows.

Aggregate metrics per run: - pass_rate: fraction of cases where pass=true - parse_rate: fraction where parse_ok=true - grounding_rate: fraction where grounding_ok=true among cases where grounding was evaluated - equivalence_rate: fraction where equivalence=equivalent (strict, excluding partial) - All sliceable by the reporting dimensions.

No weighted composite score. The individual fields are more useful than a single number — you can see whether a model fails at parsing, grounding, or intent, and drill into each.

Generation interface (maximally generic)

The runner takes a generation function — a black box that accepts a benchmark case and returns SQL. The function owns everything: context assembly, LLM calls, tool use, multi-step reasoning, whatever. The runner doesn't know or care what happens inside.

Do not bake a standard schema-metadata payload into BenchmarkCase. The benchmark case carries the question and benchmark labels; the backend decides how to fetch or construct schema context. One backend might call Dataface's profiled catalog, another might query INFORMATION_SCHEMA, another might call an MCP tool, another might use a checked-in memory file, and another might use no schema tool at all.

@dataclass
class BenchmarkCase:
    case_id: str
    question: str
    gold_sql: str
    schema: str
    complexity: str
    category: str
    metadata: dict[str, Any] = field(default_factory=dict)

GenerateFn = Callable[[BenchmarkCase], Awaitable[str]]

That's it. The function takes a BenchmarkCase, returns a SQL string. Everything else — which model, what prompt, how much schema context to include, whether to use tool-calling agents, whether to do chain-of-thought, whether to call an external API, and how schema information is obtained — is the function's problem. The runner just calls it and scores the output.

This means the cto-research team (or anyone else) can write their own function, wrap their entire agent pipeline inside it, and plug it into the runner with zero changes to the eval harness. They don't need to match Dataface's context format or LLM client.

Backend construction: factory, not one-function-per-experiment

You don't want to write generate_sql_gpt4o_full_context, generate_sql_claude_names_only, etc. — those are near-identical functions that differ by one or two config values. But you also don't want the runner itself to know about model/provider/context-level, because that breaks the black box for external teams.

The answer: a factory that produces GenerateFn closures. The runner stays dumb. The factory is a convenience for Dataface's built-in backends and can accept a generic context/tooling strategy rather than one standard schema blob.

def make_raw_llm_backend(model: str = "gpt-4o", provider: str = "openai") -> GenerateFn:
    async def generate(case: BenchmarkCase) -> str | GenerationResult:
        sql = await call_llm(model, provider, system_prompt, case.question, ...)
        return GenerationResult(sql=sql, metadata={"model": model, "provider": provider})
    return generate

def make_dataface_backend(
    model: str = "gpt-4o",
    context_config: str = "full",
    context_provider: ContextProvider | None = None,
) -> GenerateFn:
    async def generate(case: BenchmarkCase) -> str | GenerationResult:
        sql = await dataface_generate_sql(
            case.question,
            model=model,
            context_config=context_config,
            context_provider=context_provider,
        )
        return GenerationResult(
            sql=sql,
            metadata={"model": model, "context_config": context_config},
        )
    return generate

The CLI passes --model and --provider to the factory, not to the runner. The runner receives the constructed function and never sees those args:

python -m apps.evals sql --backend raw_llm --model gpt-4o --provider openai ...
python -m apps.evals sql --backend dataface --model claude-sonnet --context-level names_only ...
python -m apps.evals sql --backend path.to.module:function_name ...

For built-in backends, --model and --provider are standardized convenience args that the CLI routes to the factory. For external backends (path.to.module:function_name), these are ignored — external functions are already fully configured by whoever wrote them. The external team can use env vars, their own config files, or just hardcode values.

This gives us: - No function proliferation — one make_raw_llm_backend handles all model/provider combos - Runner stays a black box — it never sees model/provider, just the constructed function - Two standard CLI args (--model, --provider) that cover 90% of experiments without extra ceremony - External teams unaffected — their functions are self-contained, the standard args don't apply - Metadata flows automatically — the factory puts model/provider into GenerationResult.metadata, which shows up in the leaderboard - Schema-tool experiments stay possible — built-in factories can swap context providers, filter fields, or disable tools entirely without changing the runner contract

Why just --model and --provider? Those are the two dimensions you'll vary most often and they're universal across LLM backends. Other knobs (context level, prompt version, temperature, etc.) are backend-specific and belong in the factory, not the CLI. You can always add more standard args later if a pattern emerges, but starting with just two keeps it clean.

External backends (others bring their own): Any team can write a function that conforms to GenerateFn and point the runner at it:

# In cto-research or wherever — fully self-contained
async def cto_research_agent(case: BenchmarkCase) -> str | GenerationResult:
    sql = await their_agent_pipeline(case.question)
    return GenerationResult(
        sql=sql,
        metadata={"model": "their-model", "pipeline": "v2", "tools_used": True},
    )

CLI:

python -m apps.evals sql --backend path.to.module:cto_research_agent ...

Runner design

Inputs: benchmark JSONL path (defaults to apps/evals/data/benchmark.jsonl), backend (built-in name or module:function), optional --model and --provider (routed to built-in backend factories, ignored for external), optional filters (--schema, --complexity, --category, --limit). The runner constructs the GenerateFn via the factory, then treats it as opaque. Types (BenchmarkCase, GenerationResult, GenerateFn) live in apps/evals/sql/types.py. Backend factories live in apps/evals/sql/backends.py.
Generation step: For each benchmark case, call the generation function with the full BenchmarkCase. The function returns SQL.
Scoring step: Apply layer 1, then layer 2 if layer 1 passes.
Outputs: per-case JSONL results in apps/evals/output/sql/ (gitignored), per-run summary JSON with aggregate metrics sliced by reporting dimensions.
Concurrency: Async with configurable concurrency limit (like cto-research's max_concurrent pattern).

Patterns to steal from cto-research

The cto-research eval harness (context_catalog/aisql_tuning/eval_harness.py) has battle-tested patterns worth porting: - RecordingHooks for capturing tool calls, timing, and success/failure per case - Per-question markdown reports with full conversation traces (useful for debugging failures) - asyncio.gather with semaphore for concurrent eval execution - Summary JSON with breakdowns by complexity

The search eval scorer (context_catalog/evals/search_eval/scoring.py) has clean dataclass patterns: - QueryScore, RecordScore, EvalReport dataclasses - _breakdown() helper that groups records by any key function and computes per-group metrics - This is exactly what the summary/reporting layer needs

Reporting dimensions

model / agent
prompt version
schema
complexity
category

Context-pack ablation and schema-tool ablation are first-class reporting dimensions. Since the generation function owns its own context assembly, ablation is done by parameterizing the backend factory: change the context provider, field filter, layer scope, or tool availability, then emit those choices via metadata. The runner doesn't need to know about the details.

Dependencies

Depends on task 2 (cleaned benchmark artifact) for input data in apps/evals/data/.
sqlglot is needed as a new dependency — add to Dataface's pyproject.toml.
Downstream: task 4 (catalog discovery evals) derives from the same benchmark but uses a different scorer. The eval leaderboard dashboards consume runner output from apps/evals/output/sql/. Shared reporting utilities (breakdown/aggregation helpers) go in apps/evals/shared/reporting.py.

Possible Solutions

Recommended: generic runner + backend factories - Keep the runner agnostic to model/provider/context assembly, accept a BenchmarkCase -> str | GenerationResult generator, and let backend factories encapsulate prompt/context strategy. This keeps the harness reusable for Dataface and external teams.

Alternative: hard-code a single Dataface generation path into the runner - rejected because it would couple eval orchestration to one prompt stack and make ablations awkward.

Plan

Add apps/evals/sql/ with shared types, backend factories, deterministic grounding, hybrid scorer, and async runner.
Add a Typer CLI entry point that resolves built-in or external generator backends.
Add shared aggregation helpers under apps/evals/shared/ for leaderboard-ready summaries.
Add sqlglot to the project dependencies.
Add focused eval tests and verify with just ci.

Implementation Progress

Runner and scoring

Added BenchmarkCase, GenerationResult, BackendSpec, JudgeSpec, and ScoredCaseResult types.
Implemented backend factories for raw_llm, dataface, and module:function callables.
Implemented deterministic parsing and optional grounding checks with sqlglot.
Implemented LLM-as-judge scoring for semantic equivalence.
Kept the runner generic around BenchmarkCase -> str | GenerationResult and no live warehouse execution.

CLI and reporting

Added dft evals sql command wiring.
Added shared breakdown/summarize helpers for run summaries.
Added eval output directory ignore rule at apps/evals/output/.

Validation

Focused eval tests passed: uv run pytest tests/evals/sql tests/evals/test_evals_cli.py tests/evals/test_prep_benchmark.py -q
Repo tool registry regression test passed: uv run pytest tests/core/test_ai_tools.py -q
Full repo gate passed: just ci

Follow-up fix

Removed the extra information_schema handler alias from dataface/ai/tools.py so the tool registry matches ALL_TOOLS again.

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A for browser QA. Validation: ran the eval test slice against canary/fixture inputs, verified JSONL output schema and summary generation, checked that parse failures skip the judge, and confirmed the runner emits structured result records.

Review Feedback

Review executed via just review on the rebased branch after just ci passed.
Verdict: APPROVED with notes.
Notes: external backend import paths and SQL response normalization are acceptable for an internal eval harness, but should be revisited if this surface ever becomes user-facing. Summary metrics intentionally treat partially_equivalent as a pass and rely on equivalence_rate for strict success visibility.
[x] Review cleared