Dataface Tasks

Add external benchmark provenance dashboards and cross-benchmark slices

IDMCP_ANALYST_AGENT-ADD_EXTERNAL_BENCHMARK_PROVENANCE_DASHBOARDS_AND_CROSS_BENCHMARK_SLICES
Statusnot_started
Priorityp1
Milestonem2-internal-adoption-design-partners
Ownerdata-ai-engineer-architect
Initiativeexternal-text-to-sql-benchmarks-and-sota-calibration

Problem

Extend the eval leaderboard so external benchmark runs are comparable by benchmark, split, dialect, scorer, and environment provenance instead of appearing as opaque SQL runs.

Context

  • The current eval leaderboard was built around internal SQL eval runs and mostly slices by backend, model, and context level.
  • Once public benchmarks arrive, those slices are not enough. A BIRD run and a Spider 2.0-Lite run should not appear as if they are the same task just because they both produce SQL.
  • Provenance needs to be first-class in both the raw artifacts and the Dataface queries/faces layered on top of them.

Possible Solutions

  1. Recommended: add explicit benchmark provenance columns and dashboard slices to the existing leaderboard project. Keep the same apps/evals project, but extend the artifact queries and boards so benchmark, split, version, dialect, scorer, and environment are visible dimensions.

Why this is recommended:

  • preserves one place to inspect eval quality
  • keeps internal and external runs comparable without collapsing them together
  • makes hidden assumptions visible
  1. Build a separate dashboard project only for external benchmarks.

Trade-off: simpler queries at first, but creates a fragmented review experience.

  1. Hide benchmark provenance inside one long run label string.

Trade-off: fast, but makes filtering, grouping, and auditing much worse.

Plan

  1. Extend the query layer in apps/evals/faces/_sql_eval_queries.yml to expose benchmark provenance fields.
  2. Update overview/leaderboard faces to slice by benchmark family and show local-vs-official comparability clearly.
  3. Add smoke fixture coverage in tests/evals/test_leaderboard_dft.py for mixed internal and external run artifacts.
  4. Ensure empty-state behavior still works when some benchmark families have no runs.
  5. Add at least one dashboard view that answers "how are we doing on public benchmarks by benchmark and split?" without manual artifact inspection.

Implementation Progress

QA Exploration

  • [ ] QA exploration completed (or N/A for non-UI tasks)

Review Feedback

  • [ ] Review cleared