Spec
Goal
Add public text-to-SQL benchmarks to apps/evals in a way that:
- reuses the current eval runner and dashboard architecture where possible
- preserves benchmark-specific semantics instead of flattening everything into one misleading score
- makes external runs auditable by benchmark, split, dialect, scorer, and environment
Primary success criteria
- At least two external benchmarks can be run from the local eval system with reproducible baseline commands.
- External benchmark runs land in a durable artifact path and appear in the leaderboard with explicit provenance.
- Internal and external runs can be compared side-by-side without pretending they are identical tasks.
- The adapter work improves the modularity of the internal eval system instead of becoming one-off glue code.
Phase 1 benchmark set
Required
BIRD mini-dev
Spider 2.0-Lite
Second wave
LiveSQLBench open development releases
Explicitly deferred
Spider 2.0-Snow
Spider 2.0-DBT
BIRD-Interact
- hidden-test submission automation
Architectural constraints
Reuse the existing eval engine
The integration should prefer:
- benchmark-specific loaders
- benchmark-specific scorer wrappers
- benchmark-specific environment profiles
It should not fork apps/evals into unrelated per-benchmark stacks.
Preserve benchmark provenance
Every run should record at least:
- benchmark name
- benchmark version
- split
- dialect
- scorer type
- environment type
- whether the run is official-comparable, approximate, or local-only
Separate normalization from benchmark-specific truth
We want a common internal contract, but not at the cost of lying about what a benchmark means. If a benchmark has special settings or caveats, those should survive in metadata and dashboard slices.
Expected code surfaces
apps/evals/sql/ for runner/backends/scorer integration
apps/evals/data/ or adjacent benchmark-specific data roots
apps/evals/faces/ for provenance-aware dashboards
tests/evals/ for loader, scoring, and dashboard coverage
Validation strategy
Each benchmark integration should prove:
- benchmark cases load successfully
- a smoke baseline run writes durable artifacts
- the leaderboard reads those artifacts
- provenance fields are visible and queryable
- the local run mode is clearly documented as official-comparable or not