Research
Why external benchmarks matter
The internal dbt benchmark answers "are we improving on our own schemas and cases?" External benchmarks answer "are we improving relative to the public text-to-SQL frontier?" Both are useful, and they serve different purposes.
Candidate benchmarks
BIRD
- Strong public benchmark for hard one-shot text-to-SQL on large, messy, real-world databases
- Good first target because it has strong public visibility and manageable subsets such as mini-dev
- Useful for testing hallucination reduction, schema linking quality, and benchmark portability
Spider 2.0-Lite
- Strong enterprise-style benchmark for large-schema, long-context text-to-SQL
- More realistic than classic Spider
- Useful for testing whether retrieval, narrowing, and planning survive a harder public workload
LiveSQLBench
- Interesting because it is dynamic, release-based, and explicitly contamination-aware
- Should be second-wave because it adds release/version workflow complexity on top of adapter work
Internal references
ai_notes/research/TEXT_TO_SQL_EVALS_LANDSCAPE.md
ai_notes/research/TEXT_TO_SQL_SOTA_METHODS.md
ai_notes/research/SOTA/dataface-gap-analysis.md
Main design thesis
The current apps/evals architecture is already close to what external benchmark adapters need. The likely work is:
- loaders
- scorer wrappers
- provenance fields
- dashboard slices
not a wholesale rewrite.
Key risk
The biggest integration risk is not data loading. It is accidentally flattening benchmark-specific semantics into generic "pass rate" numbers that look comparable but are not. Provenance needs to be first-class from day one.