Build text-to-SQL failure taxonomy and slice dashboard

ID	MCP_ANALYST_AGENT-BUILD_TEXT_TO_SQL_FAILURE_TAXONOMY_AND_SLICE_DASHBOARD
Status	not_started
Priority	p1
Milestone	m3-public-launch
Owner	data-ai-engineer-architect

Problem

The current leaderboard tells us whether a run passed, parsed, grounded, or matched the expected answer, but that is still too coarse for day-to-day iteration. A change can improve one category of failure while making another worse, and a single pass-rate number hides that shape completely.

We need a standard failure taxonomy so we can answer questions like:

are we mostly failing on wrong joins or wrong filters
did retrieval reduce wrong-table failures but increase wrong-aggregation failures
are regressions concentrated in time-grain or ordering mistakes

Without that taxonomy, the team will keep reasoning from anecdotes instead of from repeatable slices.

Context

The eval stack already emits structured results under apps/evals/ and the leaderboard project already reads those artifacts into Dataface via DuckDB. That means the infrastructure for surfacing additional dimensions already exists.

Likely raw signals already available or easy to derive include:

parse success/failure
grounding failures and hallucinated tables/columns
equivalence verdicts and rationales
benchmark metadata such as schema, complexity, and category

What is missing is a shared classification layer that maps these signals into a stable set of failure types the dashboards can group and compare over time.

Possible Solutions

Recommended: define a small deterministic failure taxonomy and surface it in eval outputs plus dashboards. Add a classification pass that tags each non-passing case with one primary failure type and, where useful, secondary tags such as join/filter/grain ambiguity. Then expose those tags in the leaderboard faces.

Why this is recommended:

fast to reason about
easy to compare across runs
avoids making dashboard consumers reverse-engineer failure modes from raw text

Keep only freeform rationales and inspect failures manually.

Trade-off: flexible, but not queryable. This does not scale once there are many runs.

Use an LLM judge to classify all failures.

Trade-off: possibly richer, but noisier and harder to trust. Deterministic-first tagging is a better baseline.

Plan

Define a compact taxonomy, likely including: - syntax/parse - wrong table - wrong column - wrong join - wrong filter - wrong aggregation - wrong grouping/time grain - wrong ordering/limit
Add a shared classifier in the eval pipeline that assigns those tags based on existing scorer and grounding outputs.
Persist the primary failure type and optional secondary tags into results artifacts.
Add leaderboard queries and charts that slice runs by failure type and compare backend/context variants across those slices.
Validate that the taxonomy is stable enough to be useful before expanding it.

Files likely involved

apps/evals/sql/scorer.py
apps/evals/sql/runner.py
apps/evals/sql/types.py
apps/evals/faces/_sql_eval_queries.yml
eval leaderboard face files under apps/evals/faces/

Success criteria

every failed case can be grouped into a failure taxonomy bucket
the leaderboard can slice by failure type
follow-up experiments can compare changes by failure shape, not only by top-line pass rate

Implementation Progress

Not started.

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A - eval taxonomy and dashboard planning task.

Review Feedback

No review feedback yet.

[ ] Review cleared