Plan external text-to-SQL benchmark adoption order and constraints
Problem
Decide which public benchmarks to adopt first, what environments and licenses they require, and what the phase-1 integration sequence should be for apps/evals.
Context
- The repo now has a working internal eval runner and leaderboard under
apps/evals/. - Research notes in
ai_notes/research/TEXT_TO_SQL_EVALS_LANDSCAPE.mdandai_notes/research/TEXT_TO_SQL_SOTA_METHODS.mdidentifyBIRD mini-devandSpider 2.0-Liteas the highest-value first public integrations. - External benchmarks differ in licensing, local environment assumptions, dialect support, and what "official" scoring means.
- If we skip the planning pass, adapter tasks will likely make inconsistent choices about dataset storage, provenance, and what counts as a faithful local run.
Possible Solutions
- Recommended: write a short adoption plan that selects the phase-1 benchmark pair and records constraints explicitly. Choose a first-wave benchmark set, document environment/licensing/scoring assumptions, and define what "good enough for local calibration" means before implementation starts.
Why this is recommended:
- avoids parallel task drift
- gives downstream adapter tasks a clear target
- makes it easier to explain why some benchmark settings are deferred
- Let each adapter task decide its own environment and fidelity rules.
Trade-off: faster to start coding, but high risk of inconsistent local modes and misleading leaderboard comparisons.
- Plan all public benchmarks up front, including interactive and hosted settings.
Trade-off: comprehensive, but likely too broad for the first milestone and slows down useful integrations.
Plan
- Compare candidate benchmarks by: - local setup difficulty - licensing/access friction - public visibility - usefulness for Dataface-specific failure modes
- Confirm the phase-1 adoption order:
-
BIRD mini-dev-Spider 2.0-Lite - Record benchmark-specific caveats: - what local execution mode is available - what official scorer behavior we can and cannot replicate - what counts as local-only vs leaderboard-comparable
- Update the initiative docs with the adoption sequence and constraints.
- Leave
LiveSQLBench, hosted Spider 2.0 settings, and interactive benchmarks as explicit second-wave work.
Implementation Progress
QA Exploration
- [x] QA exploration completed (or N/A for non-UI tasks)
N/A - planning/task-definition work.
Review Feedback
- [ ] Review cleared