Dataface Tasks

Build text-to-SQL failure taxonomy and slice dashboard

IDMCP_ANALYST_AGENT-BUILD_TEXT_TO_SQL_FAILURE_TAXONOMY_AND_SLICE_DASHBOARD
Statusnot_started
Priorityp1
Milestonem3-public-launch
Ownerdata-ai-engineer-architect

Problem

The current leaderboard tells us whether a run passed, parsed, grounded, or matched the expected answer, but that is still too coarse for day-to-day iteration. A change can improve one category of failure while making another worse, and a single pass-rate number hides that shape completely.

We need a standard failure taxonomy so we can answer questions like:

  • are we mostly failing on wrong joins or wrong filters
  • did retrieval reduce wrong-table failures but increase wrong-aggregation failures
  • are regressions concentrated in time-grain or ordering mistakes

Without that taxonomy, the team will keep reasoning from anecdotes instead of from repeatable slices.

Context

The eval stack already emits structured results under apps/evals/ and the leaderboard project already reads those artifacts into Dataface via DuckDB. That means the infrastructure for surfacing additional dimensions already exists.

Likely raw signals already available or easy to derive include:

  • parse success/failure
  • grounding failures and hallucinated tables/columns
  • equivalence verdicts and rationales
  • benchmark metadata such as schema, complexity, and category

What is missing is a shared classification layer that maps these signals into a stable set of failure types the dashboards can group and compare over time.

Possible Solutions

  1. Recommended: define a small deterministic failure taxonomy and surface it in eval outputs plus dashboards. Add a classification pass that tags each non-passing case with one primary failure type and, where useful, secondary tags such as join/filter/grain ambiguity. Then expose those tags in the leaderboard faces.

Why this is recommended:

  • fast to reason about
  • easy to compare across runs
  • avoids making dashboard consumers reverse-engineer failure modes from raw text
  1. Keep only freeform rationales and inspect failures manually.

Trade-off: flexible, but not queryable. This does not scale once there are many runs.

  1. Use an LLM judge to classify all failures.

Trade-off: possibly richer, but noisier and harder to trust. Deterministic-first tagging is a better baseline.

Plan

  1. Define a compact taxonomy, likely including: - syntax/parse - wrong table - wrong column - wrong join - wrong filter - wrong aggregation - wrong grouping/time grain - wrong ordering/limit
  2. Add a shared classifier in the eval pipeline that assigns those tags based on existing scorer and grounding outputs.
  3. Persist the primary failure type and optional secondary tags into results artifacts.
  4. Add leaderboard queries and charts that slice runs by failure type and compare backend/context variants across those slices.
  5. Validate that the taxonomy is stable enough to be useful before expanding it.

Files likely involved

  • apps/evals/sql/scorer.py
  • apps/evals/sql/runner.py
  • apps/evals/sql/types.py
  • apps/evals/faces/_sql_eval_queries.yml
  • eval leaderboard face files under apps/evals/faces/

Success criteria

  • every failed case can be grouped into a failure taxonomy bucket
  • the leaderboard can slice by failure type
  • follow-up experiments can compare changes by failure shape, not only by top-line pass rate

Implementation Progress

Not started.

QA Exploration

  • [x] QA exploration completed (or N/A for non-UI tasks)

N/A - eval taxonomy and dashboard planning task.

Review Feedback

No review feedback yet.

  • [ ] Review cleared