type: initiative slug: external-text-to-sql-benchmarks-and-sota-calibration title: External Text-to-SQL Benchmarks and SOTA Calibration workstream: mcp-analyst-agent owner: data-ai-engineer-architect status: planned milestone: m2-internal-adoption-design-partners
{{ initiative_progress_bar("mcp-analyst-agent", "external-text-to-sql-benchmarks-and-sota-calibration") }}
Integrate public text-to-SQL benchmarks such as BIRD and Spider 2.0 into apps/evals, normalize their artifacts and provenance, and use them as external calibration alongside the internal dbt benchmark.
The internal dbt/Fivetran benchmark is still the right fast loop for Dataface. It is cheap, controllable, and directly tied to the schemas and failure modes we care about. But it is not enough if we want credible claims about external quality or want to compare our stack against the strongest public text-to-SQL systems.
This initiative exists to add that missing layer:
This initiative belongs in M2, not M1. M1 only needs a usable internal analyst workflow. External benchmark integration becomes important once we are trying to harden quality, compare against the outside world, and decide whether retrieval, planning, repair, and richer context are meaningfully closing the SOTA gap.
apps/evalsbenchmark adoption plan
↓
normalized external contracts
↓
├── BIRD mini-dev adapter
├── Spider 2.0-Lite adapter
└── LiveSQLBench workflow
↓
cross-benchmark provenance dashboards
context-catalog-nimble.