Add retrieval-side gold labels to text-to-SQL benchmark

ID	MCP_ANALYST_AGENT-ADD_RETRIEVAL_SIDE_GOLD_LABELS_TO_TEXT_TO_SQL_BENCHMARK
Status	not_started
Priority	p1
Milestone	m3-public-launch
Owner	data-ai-engineer-architect

Problem

The current benchmark is good enough to score final SQL quality, but it is still weak at explaining why a retrieval or context-narrowing change helped or hurt. Right now a run can regress because retrieval missed the right table, because isolation dropped the right column, or because generation failed after retrieval did the right thing. Without retrieval-side gold labels, all of those failure modes collapse into the same downstream SQL score.

We need a benchmark representation that can answer narrower questions such as:

did the retriever surface the right tables in the top results
did the bundle keep the right columns
did the planner identify the right semantic targets even before SQL generation

That turns retrieval and planning quality into first-class metrics instead of forcing every improvement to prove itself only through final SQL equivalence.

Context

The existing text-to-SQL eval work under apps/evals/ already gives us cleaned benchmark artifacts, deterministic scoring, leaderboard dashboards, and new retrieval-oriented tasks under context-catalog-nimble. That means we do not need a second benchmark system. We need to extend the existing benchmark schema.

The likely benchmark records already contain:

question text
gold SQL
schema/category metadata

What is missing is retrieval-oriented supervision such as:

expected tables
expected columns
expected joins or relationship edges when known
expected metric/dimension/filter targets when the question semantics are clear

The labels do not need to encode one perfect canonical reasoning path. They need to be useful enough for recall-style metrics and bundle-inclusion checks.

Possible Solutions

Recommended: add lightweight gold retrieval annotations directly to benchmark cases. Extend each case with fields such as gold_tables, gold_columns, and optional semantic target hints. Keep them small, explicit, and easy to consume from both retrieval evals and planning evals.

Why this is the right first step:

stays inside the existing benchmark artifact
makes retrieval metrics easy to compute
supports both bundle evaluation and planning evaluation
avoids inventing a second label store

Infer retrieval correctness from final SQL only.

Trade-off: cheapest in the short term, but it hides too much. A correct final SQL query does not prove the retriever was good, and a wrong final SQL query does not prove the retriever was bad.

Build a separate retrieval benchmark detached from the SQL benchmark.

Trade-off: cleaner in theory, but it duplicates curation effort and makes it harder to compare retrieval and final SQL results on the same cases.

Plan

Review the current benchmark schema and decide the minimal annotation fields that are worth standardizing.
Add retrieval-side fields to the benchmark artifact contract, likely in the benchmark loader/types and any preparation scripts under apps/evals/sql/.
Start with the highest-signal labels: - gold tables - gold columns - optional metric/dimension/filter targets when obvious
Backfill those labels for a meaningful benchmark slice first rather than trying to annotate everything at once.
Add helper utilities that compute: - top-k table hit - top-k column hit - bundle inclusion rate for gold tables/columns
Surface the new fields and metrics in eval outputs so later retrieval and planning tasks can depend on them.

Files likely involved

apps/evals/sql/types.py
apps/evals/sql/runner.py
apps/evals/sql/cli.py
benchmark artifact preparation code and data files under apps/evals/
retrieval/planning eval code that will consume the labels

Success criteria

benchmark cases can carry retrieval-side gold labels
the eval stack can read those labels without breaking existing runs
at least one retrieval-oriented metric can be computed from real benchmark cases

Implementation Progress

Not started.

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A - benchmark schema and eval task.

Review Feedback

No review feedback yet.

[ ] Review cleared