Add BIRD mini-dev adapter and runner support
Problem
Integrate BIRD mini-dev into apps/evals with a loader, benchmark-specific runner settings, and reproducible local baseline execution.
Context
BIRD mini-devis the most practical first public benchmark because it is recognized, challenging, and still manageable in a local dev workflow.- The internal eval runner already supports benchmark execution, artifact writing, and leaderboard display, but it does not yet know how to load external benchmark cases or record public-benchmark provenance.
- We want the first adapter to prove the architecture without overcommitting to every official benchmark nuance on day one.
Possible Solutions
- Recommended: add a benchmark-specific BIRD loader on top of the shared external contract and run it through the existing local eval loop.
Load
BIRD mini-devinto the shared case contract, add any BIRD-specific metadata, and keep execution inside the currentapps/evalsflow.
Why this is recommended:
- proves the adapter architecture cheaply
- yields a strong external benchmark quickly
- keeps artifact generation and dashboards consistent
- Wrap the entire official BIRD harness as an opaque subprocess first.
Trade-off: closer to official behavior, but it teaches the internal eval system less and makes provenance harder to inspect.
- Delay BIRD until after a more complex benchmark lands.
Trade-off: loses the easiest high-value public calibration target.
Plan
- Add dataset-loading code for
BIRD mini-devand map cases into the shared external contract. - Record BIRD-specific metadata needed for auditing, such as benchmark split, schema identity, and scorer mode.
- Add a reproducible local baseline command for a small reference backend/model pair.
- Write smoke tests that prove: - cases load - the run completes - artifacts land in the expected run directory - the leaderboard can display the run with BIRD provenance
- Document what is faithful to official BIRD scoring and what remains an approximation in local mode.
Implementation Progress
QA Exploration
- [x] QA exploration completed (or N/A for non-UI tasks)
N/A - implementation task, but not a browser-flow task.
Review Feedback
- [ ] Review cleared