Add latest regression delta dashboards for text-to-SQL evals

ID	MCP_ANALYST_AGENT-ADD_LATEST_REGRESSION_DELTA_DASHBOARDS_FOR_TEXT_TO_SQL_EVALS
Status	not_started
Priority	p1
Milestone	m3-public-launch
Owner	data-ai-engineer-architect

Problem

Leaderboard summaries are helpful, but they still make it too easy to miss concrete regressions. A new run can improve average pass rate while breaking a cluster of previously working cases, and the team only notices later by manually comparing artifacts.

We need regression-delta dashboards that show which cases changed between runs, not just aggregate scores.

Context

The eval project already stores run outputs and summaries, and the leaderboard reads them via DuckDB. That makes run-over-run comparison feasible without building a separate analysis system.

Useful delta views include:

newly broken cases
newly fixed cases
cases whose failure taxonomy changed
backend-specific regressions by schema or category

This task depends on having stable run identifiers and, ideally, better failure taxonomy fields so deltas are not just "pass flipped."

Possible Solutions

Recommended: add run-over-run delta queries and dashboards to the eval project. Compare a selected run against the previous run or a named baseline and surface concrete case-level changes.

Why this is recommended:

directly supports iteration
makes regressions visible immediately
builds on existing artifact and dashboard plumbing

Rely on manual diffing of JSONL files.

Trade-off: workable for one-off analysis, but too painful for repeated use.

Add only summary deltas without case-level visibility.

Trade-off: better than nothing, but still not enough to explain what changed.

Plan

Define the baseline comparison mode, likely latest run vs previous comparable run.
Add DuckDB queries that identify: - newly fixed cases - newly broken cases - unchanged failures - taxonomy shifts when available
Add one or more eval faces that make those deltas easy to scan.
Make sure comparisons are scoped sensibly by backend/model/context condition.
Document how to use the delta views when evaluating retrieval or generation changes.

Success criteria

the eval project can show case-level regression deltas between runs
iteration work can point to newly fixed and newly broken cases explicitly
regressions are easier to catch than with aggregate charts alone

Implementation Progress

Not started.

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A - eval dashboard planning task.

Review Feedback

No review feedback yet.

[ ] Review cleared