Decisions

D1. Treat external benchmarks as calibration, not replacement

Status: accepted

The internal dbt benchmark remains the primary fast loop. External benchmarks are added as calibration against public SOTA, not as a replacement for the internal benchmark.

D2. Adopt two phase-1 benchmarks before broader expansion

Status: accepted

Phase 1 focuses on:

BIRD mini-dev
Spider 2.0-Lite

This gives one broadly recognized hard benchmark and one enterprise-style benchmark before taking on more dynamic benchmark families.

D3. Make provenance explicit in the result contract

Status: accepted

Runs must carry benchmark name, split, version, dialect, scorer, and environment provenance so leaderboard comparisons are honest and reviewable.

D4. Defer dynamic and interactive benchmarks to follow-on work

Status: accepted

LiveSQLBench is a second-wave integration. Interactive and hosted benchmark settings are explicitly deferred until query-level integrations are stable.

tasks/workstreams/mcp-analyst-agent/initiatives/external-text-to-sql-benchmarks-and-sota-calibration/decisions.md

Decisions

D1. Treat external benchmarks as calibration, not replacement

D2. Adopt two phase-1 benchmarks before broader expansion

D3. Make provenance explicit in the result contract

D4. Defer dynamic and interactive benchmarks to follow-on work