Dataface Tasks

External Text-to-SQL Benchmarks and SOTA Calibration

Planned · M2 — Internal Adoption + Design Partners0 / 6 (0%)

Objective

Integrate public text-to-SQL benchmarks such as BIRD and Spider 2.0 into apps/evals, normalize their artifacts and provenance, and use them as external calibration alongside the internal dbt benchmark.

Why this initiative exists

The internal dbt/Fivetran benchmark is still the right fast loop for Dataface. It is cheap, controllable, and directly tied to the schemas and failure modes we care about. But it is not enough if we want credible claims about external quality or want to compare our stack against the strongest public text-to-SQL systems.

This initiative exists to add that missing layer:

  • public benchmark integration as external calibration
  • benchmark-aware provenance so runs are comparable and auditable
  • external leaderboard slices that sit beside, not instead of, the internal benchmark

Milestone placement

This initiative belongs in M2, not M1. M1 only needs a usable internal analyst workflow. External benchmark integration becomes important once we are trying to harden quality, compare against the outside world, and decide whether retrieval, planning, repair, and richer context are meaningfully closing the SOTA gap.

Scope

In scope

  • selecting the first external benchmarks to adopt
  • defining normalized internal contracts for external benchmark cases and results
  • integrating a first-wave pair of public benchmarks into apps/evals
  • extending leaderboard dashboards with explicit benchmark provenance
  • documenting baseline execution workflows and constraints

Out of scope

  • claiming official leaderboard parity immediately
  • integrating every public benchmark at once
  • hidden-test submission automation as a first milestone
  • interactive/agentic benchmark harnesses before the simpler query-level integrations work

Tasks

Dependency graph

benchmark adoption plan
  ↓
normalized external contracts
  ↓
├── BIRD mini-dev adapter
├── Spider 2.0-Lite adapter
└── LiveSQLBench workflow
       ↓
cross-benchmark provenance dashboards

Relationship to existing initiatives