Dataface Tasks

Add retrieval-side gold labels to text-to-SQL benchmark

IDMCP_ANALYST_AGENT-ADD_RETRIEVAL_SIDE_GOLD_LABELS_TO_TEXT_TO_SQL_BENCHMARK
Statusnot_started
Priorityp1
Milestonem3-public-launch
Ownerdata-ai-engineer-architect

Problem

The current benchmark is good enough to score final SQL quality, but it is still weak at explaining why a retrieval or context-narrowing change helped or hurt. Right now a run can regress because retrieval missed the right table, because isolation dropped the right column, or because generation failed after retrieval did the right thing. Without retrieval-side gold labels, all of those failure modes collapse into the same downstream SQL score.

We need a benchmark representation that can answer narrower questions such as:

  • did the retriever surface the right tables in the top results
  • did the bundle keep the right columns
  • did the planner identify the right semantic targets even before SQL generation

That turns retrieval and planning quality into first-class metrics instead of forcing every improvement to prove itself only through final SQL equivalence.

Context

The existing text-to-SQL eval work under apps/evals/ already gives us cleaned benchmark artifacts, deterministic scoring, leaderboard dashboards, and new retrieval-oriented tasks under context-catalog-nimble. That means we do not need a second benchmark system. We need to extend the existing benchmark schema.

The likely benchmark records already contain:

  • question text
  • gold SQL
  • schema/category metadata

What is missing is retrieval-oriented supervision such as:

  • expected tables
  • expected columns
  • expected joins or relationship edges when known
  • expected metric/dimension/filter targets when the question semantics are clear

The labels do not need to encode one perfect canonical reasoning path. They need to be useful enough for recall-style metrics and bundle-inclusion checks.

Possible Solutions

  1. Recommended: add lightweight gold retrieval annotations directly to benchmark cases. Extend each case with fields such as gold_tables, gold_columns, and optional semantic target hints. Keep them small, explicit, and easy to consume from both retrieval evals and planning evals.

Why this is the right first step:

  • stays inside the existing benchmark artifact
  • makes retrieval metrics easy to compute
  • supports both bundle evaluation and planning evaluation
  • avoids inventing a second label store
  1. Infer retrieval correctness from final SQL only.

Trade-off: cheapest in the short term, but it hides too much. A correct final SQL query does not prove the retriever was good, and a wrong final SQL query does not prove the retriever was bad.

  1. Build a separate retrieval benchmark detached from the SQL benchmark.

Trade-off: cleaner in theory, but it duplicates curation effort and makes it harder to compare retrieval and final SQL results on the same cases.

Plan

  1. Review the current benchmark schema and decide the minimal annotation fields that are worth standardizing.
  2. Add retrieval-side fields to the benchmark artifact contract, likely in the benchmark loader/types and any preparation scripts under apps/evals/sql/.
  3. Start with the highest-signal labels: - gold tables - gold columns - optional metric/dimension/filter targets when obvious
  4. Backfill those labels for a meaningful benchmark slice first rather than trying to annotate everything at once.
  5. Add helper utilities that compute: - top-k table hit - top-k column hit - bundle inclusion rate for gold tables/columns
  6. Surface the new fields and metrics in eval outputs so later retrieval and planning tasks can depend on them.

Files likely involved

  • apps/evals/sql/types.py
  • apps/evals/sql/runner.py
  • apps/evals/sql/cli.py
  • benchmark artifact preparation code and data files under apps/evals/
  • retrieval/planning eval code that will consume the labels

Success criteria

  • benchmark cases can carry retrieval-side gold labels
  • the eval stack can read those labels without breaking existing runs
  • at least one retrieval-oriented metric can be computed from real benchmark cases

Implementation Progress

Not started.

QA Exploration

  • [x] QA exploration completed (or N/A for non-UI tasks)

N/A - benchmark schema and eval task.

Review Feedback

No review feedback yet.

  • [ ] Review cleared