External Text-to-SQL Benchmarks and SOTA Calibration

Planned · M2 — Internal Adoption + Design Partners0 / 6 (0%)

Objective

Integrate public text-to-SQL benchmarks such as BIRD and Spider 2.0 into apps/evals, normalize their artifacts and provenance, and use them as external calibration alongside the internal dbt benchmark.

Why this initiative exists

The internal dbt/Fivetran benchmark is still the right fast loop for Dataface. It is cheap, controllable, and directly tied to the schemas and failure modes we care about. But it is not enough if we want credible claims about external quality or want to compare our stack against the strongest public text-to-SQL systems.

This initiative exists to add that missing layer:

public benchmark integration as external calibration
benchmark-aware provenance so runs are comparable and auditable
external leaderboard slices that sit beside, not instead of, the internal benchmark

Milestone placement

This initiative belongs in M2, not M1. M1 only needs a usable internal analyst workflow. External benchmark integration becomes important once we are trying to harden quality, compare against the outside world, and decide whether retrieval, planning, repair, and richer context are meaningfully closing the SOTA gap.

Scope

In scope

selecting the first external benchmarks to adopt
defining normalized internal contracts for external benchmark cases and results
integrating a first-wave pair of public benchmarks into apps/evals
extending leaderboard dashboards with explicit benchmark provenance
documenting baseline execution workflows and constraints

Out of scope

claiming official leaderboard parity immediately
integrating every public benchmark at once
hidden-test submission automation as a first milestone
interactive/agentic benchmark harnesses before the simpler query-level integrations work

Tasks

Plan external text-to-SQL benchmark adoption order and constraints — P1 — Decide the phase-1 benchmark set, environment assumptions, and adoption order
Define normalized external benchmark case and result contracts for apps/evals — P1 — Add the internal contract and provenance model that external adapters will share
Add BIRD mini-dev adapter and runner support — P1 — First public benchmark integration with reproducible local baseline runs
Add Spider 2.0 Lite adapter and runner support — P1 — Enterprise-scale benchmark integration for long-context calibration
Add external benchmark provenance dashboards and cross-benchmark slices — P1 — Show benchmark, split, dialect, and scorer provenance in the leaderboard
Add LiveSQLBench adapter and release-tracking workflow — P2 — Second-wave dynamic benchmark integration with release-awareness

Dependency graph

benchmark adoption plan
  ↓
normalized external contracts
  ↓
├── BIRD mini-dev adapter
├── Spider 2.0-Lite adapter
└── LiveSQLBench workflow
       ↓
cross-benchmark provenance dashboards

Relationship to existing initiatives

Builds on Benchmark-Driven Text-to-SQL and Discovery Evals, which creates the internal eval system.
Feeds AI Quality Experimentation and Context Optimization, which uses the measurement system once it exists.
Provides external calibration for retrieval, planning, repair, and context work coming from context-catalog-nimble.