Spec

Goal

Add public text-to-SQL benchmarks to apps/evals in a way that:

reuses the current eval runner and dashboard architecture where possible
preserves benchmark-specific semantics instead of flattening everything into one misleading score
makes external runs auditable by benchmark, split, dialect, scorer, and environment

Primary success criteria

At least two external benchmarks can be run from the local eval system with reproducible baseline commands.
External benchmark runs land in a durable artifact path and appear in the leaderboard with explicit provenance.
Internal and external runs can be compared side-by-side without pretending they are identical tasks.
The adapter work improves the modularity of the internal eval system instead of becoming one-off glue code.

Phase 1 benchmark set

Required

BIRD mini-dev
Spider 2.0-Lite

Second wave

LiveSQLBench open development releases

Explicitly deferred

Spider 2.0-Snow
Spider 2.0-DBT
BIRD-Interact
hidden-test submission automation

Architectural constraints

Reuse the existing eval engine

The integration should prefer:

benchmark-specific loaders
benchmark-specific scorer wrappers
benchmark-specific environment profiles

It should not fork apps/evals into unrelated per-benchmark stacks.

Preserve benchmark provenance

Every run should record at least:

benchmark name
benchmark version
split
dialect
scorer type
environment type
whether the run is official-comparable, approximate, or local-only

Separate normalization from benchmark-specific truth

We want a common internal contract, but not at the cost of lying about what a benchmark means. If a benchmark has special settings or caveats, those should survive in metadata and dashboard slices.

Expected code surfaces

apps/evals/sql/ for runner/backends/scorer integration
apps/evals/data/ or adjacent benchmark-specific data roots
apps/evals/faces/ for provenance-aware dashboards
tests/evals/ for loader, scoring, and dashboard coverage

Validation strategy

Each benchmark integration should prove:

benchmark cases load successfully
a smoke baseline run writes durable artifacts
the leaderboard reads those artifacts
provenance fields are visible and queryable
the local run mode is clearly documented as official-comparable or not

tasks/workstreams/mcp-analyst-agent/initiatives/external-text-to-sql-benchmarks-and-sota-calibration/spec.md