Persist eval runs by default and add a quick eval dashboard serve command
Problem
Make eval runs durable by default instead of transient output, and add a one-command local entrypoint for browsing the eval leaderboard project.
Context
The current eval system works, but the ergonomics are wrong in two ways:
- Eval outputs default to transient
apps/evals/output/...directories. In practice, that means real experiment results often live only in temporary worktrees unless someone manually preserves them. - The eval leaderboard is browsable with
cd apps/evals && uv run dft serve, but there is no repo-level shortcut such asjust evals-serve, so the dashboards are harder to discover than they should be.
This matters because evals are becoming a historical quality record, not just a one-off local tool: - model comparisons should be comparable over time - judge/scorer changes may require re-reading older run outputs - benchmark-cleaning and scope changes should be auditable against prior runs
The repo already has the right ingredients:
- apps/evals/dataface.yml for the leaderboard project
- apps/evals/faces/ for dashboards over eval outputs
- apps/evals/sql/runner.py and apps/evals/catalog/runner.py for output paths
What is missing is the product decision to treat committed runs as the default, plus a first-class local serve shortcut.
Possible Solutions
- Recommended - change evals to write committed baseline runs under a stable
apps/evals/runs/tree by default, and add a repo shortcut such asjust evals-serve. - Pros: historical runs become durable and easy to compare; the leaderboard has an obvious stable data source; users get a simple command to browse results. - Cons: repo churn will grow until pruning/retention rules are added. - Keep transient output as the default, but add a separate “promote this run” workflow for results worth saving. - Pros: lower repo churn. - Cons: easy to lose important runs; more operational friction.
- Leave results transient and only persist summaries elsewhere. - Pros: least git noise. - Cons: loses exact per-case artifacts and weakens historical re-evaluation.
Plan
- Introduce a stable committed output root for eval runs, likely
apps/evals/runs/sql/andapps/evals/runs/catalog/. - Update SQL and catalog eval runners so the default output path writes there instead of transient
apps/evals/output/.... - Keep scratch/override support via explicit
--output-dirso ad hoc local experimentation is still possible. - Update the eval leaderboard project queries/faces to read from the committed runs path by default.
- Add a repo-level shortcut such as
just evals-servethat serves the leaderboard project fromapps/evals/. - Document the intended workflow: - run evals - inspect the leaderboard - prune unwanted runs later if needed
- Leave retention/pruning as follow-on work instead of blocking this shift in defaults.
Implementation Progress
- Task created to capture the new default policy: persist all runs first, prune later.
- No implementation has started yet.
QA Exploration
- [ ] QA exploration completed (or N/A for non-UI tasks)
Review Feedback
- [ ] Review cleared