Dataface Tasks

Persist eval runs by default and add a quick eval dashboard serve command

IDMCP_ANALYST_AGENT-PERSIST_EVAL_RUNS_BY_DEFAULT_AND_ADD_A_QUICK_EVAL_DASHBOARD_SERVE_COMMAND
Statuscompleted
Priorityp2
Milestonem2-internal-adoption-design-partners
Ownerdata-ai-engineer-architect
Initiativebenchmark-driven-text-to-sql-and-discovery-evals
Completed bydave
Completed2026-03-22

Problem

Make eval runs durable by default instead of transient output, and add a one-command local entrypoint for browsing the eval leaderboard project.

Context

The current eval system works, but the ergonomics are wrong in two ways:

  1. Eval outputs default to transient apps/evals/output/... directories. In practice, that means real experiment results often live only in temporary worktrees unless someone manually preserves them.
  2. The eval leaderboard is browsable with cd apps/evals && uv run dft serve, but there is no repo-level shortcut such as just evals-serve, so the dashboards are harder to discover than they should be.

This matters because evals are becoming a historical quality record, not just a one-off local tool: - model comparisons should be comparable over time - judge/scorer changes may require re-reading older run outputs - benchmark-cleaning and scope changes should be auditable against prior runs

The repo already has the right ingredients: - apps/evals/dataface.yml for the leaderboard project - apps/evals/faces/ for dashboards over eval outputs - apps/evals/sql/runner.py and apps/evals/catalog/runner.py for output paths

What is missing is the product decision to treat committed runs as the default, plus a first-class local serve shortcut.

Possible Solutions

  1. Recommended - change evals to write committed baseline runs under a stable apps/evals/runs/ tree by default, and add a repo shortcut such as just evals-serve. - Pros: historical runs become durable and easy to compare; the leaderboard has an obvious stable data source; users get a simple command to browse results. - Cons: repo churn will grow until pruning/retention rules are added.
  2. Keep transient output as the default, but add a separate “promote this run” workflow for results worth saving. - Pros: lower repo churn. - Cons: easy to lose important runs; more operational friction.
  3. Leave results transient and only persist summaries elsewhere. - Pros: least git noise. - Cons: loses exact per-case artifacts and weakens historical re-evaluation.

Plan

  1. Introduce a stable committed output root for eval runs, likely apps/evals/runs/sql/ and apps/evals/runs/catalog/.
  2. Update SQL and catalog eval runners so the default output path writes there instead of transient apps/evals/output/....
  3. Keep scratch/override support via explicit --output-dir so ad hoc local experimentation is still possible.
  4. Update the eval leaderboard project queries/faces to read from the committed runs path by default.
  5. Add a repo-level shortcut such as just evals-serve that serves the leaderboard project from apps/evals/.
  6. Document the intended workflow: - run evals - inspect the leaderboard - prune unwanted runs later if needed
  7. Leave retention/pruning as follow-on work instead of blocking this shift in defaults.

Implementation Progress

  • Task created to capture the new default policy: persist all runs first, prune later.
  • No implementation has started yet.

QA Exploration

  • [ ] QA exploration completed (or N/A for non-UI tasks)

Review Feedback

  • [ ] Review cleared