Persist eval runs by default and add a quick eval dashboard serve command

ID	MCP_ANALYST_AGENT-PERSIST_EVAL_RUNS_BY_DEFAULT_AND_ADD_A_QUICK_EVAL_DASHBOARD_SERVE_COMMAND
Status	completed
Priority	p2
Milestone	m2-internal-adoption-design-partners
Owner	data-ai-engineer-architect
Initiative	benchmark-driven-text-to-sql-and-discovery-evals
Completed by	dave
Completed	2026-03-22

Problem

Make eval runs durable by default instead of transient output, and add a one-command local entrypoint for browsing the eval leaderboard project.

Context

The current eval system works, but the ergonomics are wrong in two ways:

Eval outputs default to transient apps/evals/output/... directories. In practice, that means real experiment results often live only in temporary worktrees unless someone manually preserves them.
The eval leaderboard is browsable with cd apps/evals && uv run dft serve, but there is no repo-level shortcut such as just evals-serve, so the dashboards are harder to discover than they should be.

This matters because evals are becoming a historical quality record, not just a one-off local tool: - model comparisons should be comparable over time - judge/scorer changes may require re-reading older run outputs - benchmark-cleaning and scope changes should be auditable against prior runs

The repo already has the right ingredients: - apps/evals/dataface.yml for the leaderboard project - apps/evals/faces/ for dashboards over eval outputs - apps/evals/sql/runner.py and apps/evals/catalog/runner.py for output paths

What is missing is the product decision to treat committed runs as the default, plus a first-class local serve shortcut.

Possible Solutions

Recommended - change evals to write committed baseline runs under a stable apps/evals/runs/ tree by default, and add a repo shortcut such as just evals-serve. - Pros: historical runs become durable and easy to compare; the leaderboard has an obvious stable data source; users get a simple command to browse results. - Cons: repo churn will grow until pruning/retention rules are added.
Keep transient output as the default, but add a separate “promote this run” workflow for results worth saving. - Pros: lower repo churn. - Cons: easy to lose important runs; more operational friction.
Leave results transient and only persist summaries elsewhere. - Pros: least git noise. - Cons: loses exact per-case artifacts and weakens historical re-evaluation.

Plan

Introduce a stable committed output root for eval runs, likely apps/evals/runs/sql/ and apps/evals/runs/catalog/.
Update SQL and catalog eval runners so the default output path writes there instead of transient apps/evals/output/....
Keep scratch/override support via explicit --output-dir so ad hoc local experimentation is still possible.
Update the eval leaderboard project queries/faces to read from the committed runs path by default.
Add a repo-level shortcut such as just evals-serve that serves the leaderboard project from apps/evals/.
Document the intended workflow: - run evals - inspect the leaderboard - prune unwanted runs later if needed
Leave retention/pruning as follow-on work instead of blocking this shift in defaults.

Implementation Progress

Task created to capture the new default policy: persist all runs first, prune later.
No implementation has started yet.

QA Exploration

[ ] QA exploration completed (or N/A for non-UI tasks)

Review Feedback

[ ] Review cleared