Dataface Tasks

Plan external text-to-SQL benchmark adoption order and constraints

IDMCP_ANALYST_AGENT-PLAN_EXTERNAL_TEXT_TO_SQL_BENCHMARK_ADOPTION_ORDER_AND_CONSTRAINTS
Statusnot_started
Priorityp1
Milestonem2-internal-adoption-design-partners
Ownerdata-ai-engineer-architect
Initiativeexternal-text-to-sql-benchmarks-and-sota-calibration

Problem

Decide which public benchmarks to adopt first, what environments and licenses they require, and what the phase-1 integration sequence should be for apps/evals.

Context

  • The repo now has a working internal eval runner and leaderboard under apps/evals/.
  • Research notes in ai_notes/research/TEXT_TO_SQL_EVALS_LANDSCAPE.md and ai_notes/research/TEXT_TO_SQL_SOTA_METHODS.md identify BIRD mini-dev and Spider 2.0-Lite as the highest-value first public integrations.
  • External benchmarks differ in licensing, local environment assumptions, dialect support, and what "official" scoring means.
  • If we skip the planning pass, adapter tasks will likely make inconsistent choices about dataset storage, provenance, and what counts as a faithful local run.

Possible Solutions

  1. Recommended: write a short adoption plan that selects the phase-1 benchmark pair and records constraints explicitly. Choose a first-wave benchmark set, document environment/licensing/scoring assumptions, and define what "good enough for local calibration" means before implementation starts.

Why this is recommended:

  • avoids parallel task drift
  • gives downstream adapter tasks a clear target
  • makes it easier to explain why some benchmark settings are deferred
  1. Let each adapter task decide its own environment and fidelity rules.

Trade-off: faster to start coding, but high risk of inconsistent local modes and misleading leaderboard comparisons.

  1. Plan all public benchmarks up front, including interactive and hosted settings.

Trade-off: comprehensive, but likely too broad for the first milestone and slows down useful integrations.

Plan

  1. Compare candidate benchmarks by: - local setup difficulty - licensing/access friction - public visibility - usefulness for Dataface-specific failure modes
  2. Confirm the phase-1 adoption order: - BIRD mini-dev - Spider 2.0-Lite
  3. Record benchmark-specific caveats: - what local execution mode is available - what official scorer behavior we can and cannot replicate - what counts as local-only vs leaderboard-comparable
  4. Update the initiative docs with the adoption sequence and constraints.
  5. Leave LiveSQLBench, hosted Spider 2.0 settings, and interactive benchmarks as explicit second-wave work.

Implementation Progress

QA Exploration

  • [x] QA exploration completed (or N/A for non-UI tasks)

N/A - planning/task-definition work.

Review Feedback

  • [ ] Review cleared