Plan external text-to-SQL benchmark adoption order and constraints

ID	MCP_ANALYST_AGENT-PLAN_EXTERNAL_TEXT_TO_SQL_BENCHMARK_ADOPTION_ORDER_AND_CONSTRAINTS
Status	not_started
Priority	p1
Milestone	m2-internal-adoption-design-partners
Owner	data-ai-engineer-architect
Initiative	external-text-to-sql-benchmarks-and-sota-calibration

Problem

Decide which public benchmarks to adopt first, what environments and licenses they require, and what the phase-1 integration sequence should be for apps/evals.

Context

The repo now has a working internal eval runner and leaderboard under apps/evals/.
Research notes in ai_notes/research/TEXT_TO_SQL_EVALS_LANDSCAPE.md and ai_notes/research/TEXT_TO_SQL_SOTA_METHODS.md identify BIRD mini-dev and Spider 2.0-Lite as the highest-value first public integrations.
External benchmarks differ in licensing, local environment assumptions, dialect support, and what "official" scoring means.
If we skip the planning pass, adapter tasks will likely make inconsistent choices about dataset storage, provenance, and what counts as a faithful local run.

Possible Solutions

Recommended: write a short adoption plan that selects the phase-1 benchmark pair and records constraints explicitly. Choose a first-wave benchmark set, document environment/licensing/scoring assumptions, and define what "good enough for local calibration" means before implementation starts.

Why this is recommended:

avoids parallel task drift
gives downstream adapter tasks a clear target
makes it easier to explain why some benchmark settings are deferred

Let each adapter task decide its own environment and fidelity rules.

Trade-off: faster to start coding, but high risk of inconsistent local modes and misleading leaderboard comparisons.

Plan all public benchmarks up front, including interactive and hosted settings.

Trade-off: comprehensive, but likely too broad for the first milestone and slows down useful integrations.

Plan

Compare candidate benchmarks by: - local setup difficulty - licensing/access friction - public visibility - usefulness for Dataface-specific failure modes
Confirm the phase-1 adoption order: - BIRD mini-dev - Spider 2.0-Lite
Record benchmark-specific caveats: - what local execution mode is available - what official scorer behavior we can and cannot replicate - what counts as local-only vs leaderboard-comparable
Update the initiative docs with the adoption sequence and constraints.
Leave LiveSQLBench, hosted Spider 2.0 settings, and interactive benchmarks as explicit second-wave work.

Implementation Progress

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A - planning/task-definition work.

Review Feedback

[ ] Review cleared