Research

Why external benchmarks matter

The internal dbt benchmark answers "are we improving on our own schemas and cases?" External benchmarks answer "are we improving relative to the public text-to-SQL frontier?" Both are useful, and they serve different purposes.

Candidate benchmarks

BIRD

Strong public benchmark for hard one-shot text-to-SQL on large, messy, real-world databases
Good first target because it has strong public visibility and manageable subsets such as mini-dev
Useful for testing hallucination reduction, schema linking quality, and benchmark portability

Spider 2.0-Lite

Strong enterprise-style benchmark for large-schema, long-context text-to-SQL
More realistic than classic Spider
Useful for testing whether retrieval, narrowing, and planning survive a harder public workload

LiveSQLBench

Interesting because it is dynamic, release-based, and explicitly contamination-aware
Should be second-wave because it adds release/version workflow complexity on top of adapter work

Internal references

ai_notes/research/TEXT_TO_SQL_EVALS_LANDSCAPE.md
ai_notes/research/TEXT_TO_SQL_SOTA_METHODS.md
ai_notes/research/SOTA/dataface-gap-analysis.md

Main design thesis

The current apps/evals architecture is already close to what external benchmark adapters need. The likely work is:

loaders
scorer wrappers
provenance fields
dashboard slices

not a wholesale rewrite.

Key risk

The biggest integration risk is not data loading. It is accidentally flattening benchmark-specific semantics into generic "pass rate" numbers that look comparable but are not. Provenance needs to be first-class from day one.

tasks/workstreams/mcp-analyst-agent/initiatives/external-text-to-sql-benchmarks-and-sota-calibration/research.md