tasks/workstreams/mcp-analyst-agent/initiatives/external-text-to-sql-benchmarks-and-sota-calibration/research.md

Research

Why external benchmarks matter

The internal dbt benchmark answers "are we improving on our own schemas and cases?" External benchmarks answer "are we improving relative to the public text-to-SQL frontier?" Both are useful, and they serve different purposes.

Candidate benchmarks

BIRD

Spider 2.0-Lite

LiveSQLBench

Internal references

Main design thesis

The current apps/evals architecture is already close to what external benchmark adapters need. The likely work is:

not a wholesale rewrite.

Key risk

The biggest integration risk is not data loading. It is accidentally flattening benchmark-specific semantics into generic "pass rate" numbers that look comparable but are not. Provenance needs to be first-class from day one.