Dataface Tasks

Add latest regression delta dashboards for text-to-SQL evals

IDMCP_ANALYST_AGENT-ADD_LATEST_REGRESSION_DELTA_DASHBOARDS_FOR_TEXT_TO_SQL_EVALS
Statusnot_started
Priorityp1
Milestonem3-public-launch
Ownerdata-ai-engineer-architect

Problem

Leaderboard summaries are helpful, but they still make it too easy to miss concrete regressions. A new run can improve average pass rate while breaking a cluster of previously working cases, and the team only notices later by manually comparing artifacts.

We need regression-delta dashboards that show which cases changed between runs, not just aggregate scores.

Context

The eval project already stores run outputs and summaries, and the leaderboard reads them via DuckDB. That makes run-over-run comparison feasible without building a separate analysis system.

Useful delta views include:

  • newly broken cases
  • newly fixed cases
  • cases whose failure taxonomy changed
  • backend-specific regressions by schema or category

This task depends on having stable run identifiers and, ideally, better failure taxonomy fields so deltas are not just "pass flipped."

Possible Solutions

  1. Recommended: add run-over-run delta queries and dashboards to the eval project. Compare a selected run against the previous run or a named baseline and surface concrete case-level changes.

Why this is recommended:

  • directly supports iteration
  • makes regressions visible immediately
  • builds on existing artifact and dashboard plumbing
  1. Rely on manual diffing of JSONL files.

Trade-off: workable for one-off analysis, but too painful for repeated use.

  1. Add only summary deltas without case-level visibility.

Trade-off: better than nothing, but still not enough to explain what changed.

Plan

  1. Define the baseline comparison mode, likely latest run vs previous comparable run.
  2. Add DuckDB queries that identify: - newly fixed cases - newly broken cases - unchanged failures - taxonomy shifts when available
  3. Add one or more eval faces that make those deltas easy to scan.
  4. Make sure comparisons are scoped sensibly by backend/model/context condition.
  5. Document how to use the delta views when evaluating retrieval or generation changes.

Success criteria

  • the eval project can show case-level regression deltas between runs
  • iteration work can point to newly fixed and newly broken cases explicitly
  • regressions are easier to catch than with aggregate charts alone

Implementation Progress

Not started.

QA Exploration

  • [x] QA exploration completed (or N/A for non-UI tasks)

N/A - eval dashboard planning task.

Review Feedback

No review feedback yet.

  • [ ] Review cleared