tasks/workstreams/mcp-analyst-agent/initiatives/external-text-to-sql-benchmarks-and-sota-calibration/spec.md

Spec

Goal

Add public text-to-SQL benchmarks to apps/evals in a way that:

Primary success criteria

  1. At least two external benchmarks can be run from the local eval system with reproducible baseline commands.
  2. External benchmark runs land in a durable artifact path and appear in the leaderboard with explicit provenance.
  3. Internal and external runs can be compared side-by-side without pretending they are identical tasks.
  4. The adapter work improves the modularity of the internal eval system instead of becoming one-off glue code.

Phase 1 benchmark set

Required

Second wave

Explicitly deferred

Architectural constraints

Reuse the existing eval engine

The integration should prefer:

It should not fork apps/evals into unrelated per-benchmark stacks.

Preserve benchmark provenance

Every run should record at least:

Separate normalization from benchmark-specific truth

We want a common internal contract, but not at the cost of lying about what a benchmark means. If a benchmark has special settings or caveats, those should survive in metadata and dashboard slices.

Expected code surfaces

Validation strategy

Each benchmark integration should prove:

  1. benchmark cases load successfully
  2. a smoke baseline run writes durable artifacts
  3. the leaderboard reads those artifacts
  4. provenance fields are visible and queryable
  5. the local run mode is clearly documented as official-comparable or not