Dataface Tasks

Add BIRD mini-dev adapter and runner support

IDMCP_ANALYST_AGENT-ADD_BIRD_MINI_DEV_ADAPTER_AND_RUNNER_SUPPORT
Statusnot_started
Priorityp1
Milestonem2-internal-adoption-design-partners
Ownerdata-ai-engineer-architect
Initiativeexternal-text-to-sql-benchmarks-and-sota-calibration

Problem

Integrate BIRD mini-dev into apps/evals with a loader, benchmark-specific runner settings, and reproducible local baseline execution.

Context

  • BIRD mini-dev is the most practical first public benchmark because it is recognized, challenging, and still manageable in a local dev workflow.
  • The internal eval runner already supports benchmark execution, artifact writing, and leaderboard display, but it does not yet know how to load external benchmark cases or record public-benchmark provenance.
  • We want the first adapter to prove the architecture without overcommitting to every official benchmark nuance on day one.

Possible Solutions

  1. Recommended: add a benchmark-specific BIRD loader on top of the shared external contract and run it through the existing local eval loop. Load BIRD mini-dev into the shared case contract, add any BIRD-specific metadata, and keep execution inside the current apps/evals flow.

Why this is recommended:

  • proves the adapter architecture cheaply
  • yields a strong external benchmark quickly
  • keeps artifact generation and dashboards consistent
  1. Wrap the entire official BIRD harness as an opaque subprocess first.

Trade-off: closer to official behavior, but it teaches the internal eval system less and makes provenance harder to inspect.

  1. Delay BIRD until after a more complex benchmark lands.

Trade-off: loses the easiest high-value public calibration target.

Plan

  1. Add dataset-loading code for BIRD mini-dev and map cases into the shared external contract.
  2. Record BIRD-specific metadata needed for auditing, such as benchmark split, schema identity, and scorer mode.
  3. Add a reproducible local baseline command for a small reference backend/model pair.
  4. Write smoke tests that prove: - cases load - the run completes - artifacts land in the expected run directory - the leaderboard can display the run with BIRD provenance
  5. Document what is faithful to official BIRD scoring and what remains an approximation in local mode.

Implementation Progress

QA Exploration

  • [x] QA exploration completed (or N/A for non-UI tasks)

N/A - implementation task, but not a browser-flow task.

Review Feedback

  • [ ] Review cleared