tasks/workstreams/mcp-analyst-agent/initiatives/external-text-to-sql-benchmarks-and-sota-calibration/index.md


type: initiative slug: external-text-to-sql-benchmarks-and-sota-calibration title: External Text-to-SQL Benchmarks and SOTA Calibration workstream: mcp-analyst-agent owner: data-ai-engineer-architect status: planned milestone: m2-internal-adoption-design-partners


External Text-to-SQL Benchmarks and SOTA Calibration

{{ initiative_progress_bar("mcp-analyst-agent", "external-text-to-sql-benchmarks-and-sota-calibration") }}

Objective

Integrate public text-to-SQL benchmarks such as BIRD and Spider 2.0 into apps/evals, normalize their artifacts and provenance, and use them as external calibration alongside the internal dbt benchmark.

Why this initiative exists

The internal dbt/Fivetran benchmark is still the right fast loop for Dataface. It is cheap, controllable, and directly tied to the schemas and failure modes we care about. But it is not enough if we want credible claims about external quality or want to compare our stack against the strongest public text-to-SQL systems.

This initiative exists to add that missing layer:

Milestone placement

This initiative belongs in M2, not M1. M1 only needs a usable internal analyst workflow. External benchmark integration becomes important once we are trying to harden quality, compare against the outside world, and decide whether retrieval, planning, repair, and richer context are meaningfully closing the SOTA gap.

Scope

In scope

Out of scope

Tasks

Dependency graph

benchmark adoption plan
  ↓
normalized external contracts
  ↓
├── BIRD mini-dev adapter
├── Spider 2.0-Lite adapter
└── LiveSQLBench workflow
       ↓
cross-benchmark provenance dashboards

Relationship to existing initiatives