Add BIRD mini-dev adapter and runner support

ID	MCP_ANALYST_AGENT-ADD_BIRD_MINI_DEV_ADAPTER_AND_RUNNER_SUPPORT
Status	not_started
Priority	p1
Milestone	m2-internal-adoption-design-partners
Owner	data-ai-engineer-architect
Initiative	external-text-to-sql-benchmarks-and-sota-calibration

Problem

Integrate BIRD mini-dev into apps/evals with a loader, benchmark-specific runner settings, and reproducible local baseline execution.

BIRD mini-dev is the most practical first public benchmark because it is recognized, challenging, and still manageable in a local dev workflow.
The internal eval runner already supports benchmark execution, artifact writing, and leaderboard display, but it does not yet know how to load external benchmark cases or record public-benchmark provenance.
We want the first adapter to prove the architecture without overcommitting to every official benchmark nuance on day one.

Recommended: add a benchmark-specific BIRD loader on top of the shared external contract and run it through the existing local eval loop. Load BIRD mini-dev into the shared case contract, add any BIRD-specific metadata, and keep execution inside the current apps/evals flow.

Why this is recommended:

Trade-off: closer to official behavior, but it teaches the internal eval system less and makes provenance harder to inspect.

Trade-off: loses the easiest high-value public calibration target.

Add dataset-loading code for BIRD mini-dev and map cases into the shared external contract.
Record BIRD-specific metadata needed for auditing, such as benchmark split, schema identity, and scorer mode.
Add a reproducible local baseline command for a small reference backend/model pair.
Write smoke tests that prove: - cases load - the run completes - artifacts land in the expected run directory - the leaderboard can display the run with BIRD provenance
Document what is faithful to official BIRD scoring and what remains an approximation in local mode.

N/A - implementation task, but not a browser-flow task.