Add external benchmark provenance dashboards and cross-benchmark slices
Problem
Extend the eval leaderboard so external benchmark runs are comparable by benchmark, split, dialect, scorer, and environment provenance instead of appearing as opaque SQL runs.
Context
- The current eval leaderboard was built around internal SQL eval runs and mostly slices by backend, model, and context level.
- Once public benchmarks arrive, those slices are not enough. A
BIRDrun and aSpider 2.0-Literun should not appear as if they are the same task just because they both produce SQL. - Provenance needs to be first-class in both the raw artifacts and the Dataface queries/faces layered on top of them.
Possible Solutions
- Recommended: add explicit benchmark provenance columns and dashboard slices to the existing leaderboard project.
Keep the same
apps/evalsproject, but extend the artifact queries and boards so benchmark, split, version, dialect, scorer, and environment are visible dimensions.
Why this is recommended:
- preserves one place to inspect eval quality
- keeps internal and external runs comparable without collapsing them together
- makes hidden assumptions visible
- Build a separate dashboard project only for external benchmarks.
Trade-off: simpler queries at first, but creates a fragmented review experience.
- Hide benchmark provenance inside one long run label string.
Trade-off: fast, but makes filtering, grouping, and auditing much worse.
Plan
- Extend the query layer in
apps/evals/faces/_sql_eval_queries.ymlto expose benchmark provenance fields. - Update overview/leaderboard faces to slice by benchmark family and show local-vs-official comparability clearly.
- Add smoke fixture coverage in
tests/evals/test_leaderboard_dft.pyfor mixed internal and external run artifacts. - Ensure empty-state behavior still works when some benchmark families have no runs.
- Add at least one dashboard view that answers "how are we doing on public benchmarks by benchmark and split?" without manual artifact inspection.
Implementation Progress
QA Exploration
- [ ] QA exploration completed (or N/A for non-UI tasks)
Review Feedback
- [ ] Review cleared