Add external benchmark provenance dashboards and cross-benchmark slices

ID	MCP_ANALYST_AGENT-ADD_EXTERNAL_BENCHMARK_PROVENANCE_DASHBOARDS_AND_CROSS_BENCHMARK_SLICES
Status	not_started
Priority	p1
Milestone	m2-internal-adoption-design-partners
Owner	data-ai-engineer-architect
Initiative	external-text-to-sql-benchmarks-and-sota-calibration

Problem

Extend the eval leaderboard so external benchmark runs are comparable by benchmark, split, dialect, scorer, and environment provenance instead of appearing as opaque SQL runs.

Context

The current eval leaderboard was built around internal SQL eval runs and mostly slices by backend, model, and context level.
Once public benchmarks arrive, those slices are not enough. A BIRD run and a Spider 2.0-Lite run should not appear as if they are the same task just because they both produce SQL.
Provenance needs to be first-class in both the raw artifacts and the Dataface queries/faces layered on top of them.

Possible Solutions

Recommended: add explicit benchmark provenance columns and dashboard slices to the existing leaderboard project. Keep the same apps/evals project, but extend the artifact queries and boards so benchmark, split, version, dialect, scorer, and environment are visible dimensions.

Why this is recommended:

preserves one place to inspect eval quality
keeps internal and external runs comparable without collapsing them together
makes hidden assumptions visible

Build a separate dashboard project only for external benchmarks.

Trade-off: simpler queries at first, but creates a fragmented review experience.

Hide benchmark provenance inside one long run label string.

Trade-off: fast, but makes filtering, grouping, and auditing much worse.

Plan

Extend the query layer in apps/evals/faces/_sql_eval_queries.yml to expose benchmark provenance fields.
Update overview/leaderboard faces to slice by benchmark family and show local-vs-official comparability clearly.
Add smoke fixture coverage in tests/evals/test_leaderboard_dft.py for mixed internal and external run artifacts.
Ensure empty-state behavior still works when some benchmark families have no runs.
Add at least one dashboard view that answers "how are we doing on public benchmarks by benchmark and split?" without manual artifact inspection.

Add external benchmark provenance dashboards and cross-benchmark slices

Problem

Context

Possible Solutions

Plan

Implementation Progress

QA Exploration

Review Feedback