External Text-to-SQL Benchmarks and SOTA Calibration
Objective
Integrate public text-to-SQL benchmarks such as BIRD and Spider 2.0 into apps/evals, normalize their artifacts and provenance, and use them as external calibration alongside the internal dbt benchmark.
Why this initiative exists
The internal dbt/Fivetran benchmark is still the right fast loop for Dataface. It is cheap, controllable, and directly tied to the schemas and failure modes we care about. But it is not enough if we want credible claims about external quality or want to compare our stack against the strongest public text-to-SQL systems.
This initiative exists to add that missing layer:
- public benchmark integration as external calibration
- benchmark-aware provenance so runs are comparable and auditable
- external leaderboard slices that sit beside, not instead of, the internal benchmark
Milestone placement
This initiative belongs in M2, not M1. M1 only needs a usable internal analyst workflow. External benchmark integration becomes important once we are trying to harden quality, compare against the outside world, and decide whether retrieval, planning, repair, and richer context are meaningfully closing the SOTA gap.
Scope
In scope
- selecting the first external benchmarks to adopt
- defining normalized internal contracts for external benchmark cases and results
- integrating a first-wave pair of public benchmarks into
apps/evals - extending leaderboard dashboards with explicit benchmark provenance
- documenting baseline execution workflows and constraints
Out of scope
- claiming official leaderboard parity immediately
- integrating every public benchmark at once
- hidden-test submission automation as a first milestone
- interactive/agentic benchmark harnesses before the simpler query-level integrations work
Tasks
- Plan external text-to-SQL benchmark adoption order and constraints — P1 — Decide the phase-1 benchmark set, environment assumptions, and adoption order
- Define normalized external benchmark case and result contracts for apps/evals — P1 — Add the internal contract and provenance model that external adapters will share
- Add BIRD mini-dev adapter and runner support — P1 — First public benchmark integration with reproducible local baseline runs
- Add Spider 2.0 Lite adapter and runner support — P1 — Enterprise-scale benchmark integration for long-context calibration
- Add external benchmark provenance dashboards and cross-benchmark slices — P1 — Show benchmark, split, dialect, and scorer provenance in the leaderboard
- Add LiveSQLBench adapter and release-tracking workflow — P2 — Second-wave dynamic benchmark integration with release-awareness
Dependency graph
benchmark adoption plan
↓
normalized external contracts
↓
├── BIRD mini-dev adapter
├── Spider 2.0-Lite adapter
└── LiveSQLBench workflow
↓
cross-benchmark provenance dashboards
Relationship to existing initiatives
- Builds on Benchmark-Driven Text-to-SQL and Discovery Evals, which creates the internal eval system.
- Feeds AI Quality Experimentation and Context Optimization, which uses the measurement system once it exists.
- Provides external calibration for retrieval, planning, repair, and context work coming from
context-catalog-nimble.