Add retrieval-side gold labels to text-to-SQL benchmark
Problem
The current benchmark is good enough to score final SQL quality, but it is still weak at explaining why a retrieval or context-narrowing change helped or hurt. Right now a run can regress because retrieval missed the right table, because isolation dropped the right column, or because generation failed after retrieval did the right thing. Without retrieval-side gold labels, all of those failure modes collapse into the same downstream SQL score.
We need a benchmark representation that can answer narrower questions such as:
- did the retriever surface the right tables in the top results
- did the bundle keep the right columns
- did the planner identify the right semantic targets even before SQL generation
That turns retrieval and planning quality into first-class metrics instead of forcing every improvement to prove itself only through final SQL equivalence.
Context
The existing text-to-SQL eval work under apps/evals/ already gives us cleaned benchmark artifacts, deterministic scoring, leaderboard dashboards, and new retrieval-oriented tasks under context-catalog-nimble. That means we do not need a second benchmark system. We need to extend the existing benchmark schema.
The likely benchmark records already contain:
- question text
- gold SQL
- schema/category metadata
What is missing is retrieval-oriented supervision such as:
- expected tables
- expected columns
- expected joins or relationship edges when known
- expected metric/dimension/filter targets when the question semantics are clear
The labels do not need to encode one perfect canonical reasoning path. They need to be useful enough for recall-style metrics and bundle-inclusion checks.
Possible Solutions
- Recommended: add lightweight gold retrieval annotations directly to benchmark cases.
Extend each case with fields such as
gold_tables,gold_columns, and optional semantic target hints. Keep them small, explicit, and easy to consume from both retrieval evals and planning evals.
Why this is the right first step:
- stays inside the existing benchmark artifact
- makes retrieval metrics easy to compute
- supports both bundle evaluation and planning evaluation
- avoids inventing a second label store
- Infer retrieval correctness from final SQL only.
Trade-off: cheapest in the short term, but it hides too much. A correct final SQL query does not prove the retriever was good, and a wrong final SQL query does not prove the retriever was bad.
- Build a separate retrieval benchmark detached from the SQL benchmark.
Trade-off: cleaner in theory, but it duplicates curation effort and makes it harder to compare retrieval and final SQL results on the same cases.
Plan
- Review the current benchmark schema and decide the minimal annotation fields that are worth standardizing.
- Add retrieval-side fields to the benchmark artifact contract, likely in the benchmark loader/types and any preparation scripts under
apps/evals/sql/. - Start with the highest-signal labels: - gold tables - gold columns - optional metric/dimension/filter targets when obvious
- Backfill those labels for a meaningful benchmark slice first rather than trying to annotate everything at once.
- Add helper utilities that compute: - top-k table hit - top-k column hit - bundle inclusion rate for gold tables/columns
- Surface the new fields and metrics in eval outputs so later retrieval and planning tasks can depend on them.
Files likely involved
apps/evals/sql/types.pyapps/evals/sql/runner.pyapps/evals/sql/cli.py- benchmark artifact preparation code and data files under
apps/evals/ - retrieval/planning eval code that will consume the labels
Success criteria
- benchmark cases can carry retrieval-side gold labels
- the eval stack can read those labels without breaking existing runs
- at least one retrieval-oriented metric can be computed from real benchmark cases
Implementation Progress
Not started.
QA Exploration
- [x] QA exploration completed (or N/A for non-UI tasks)
N/A - benchmark schema and eval task.
Review Feedback
No review feedback yet.
- [ ] Review cleared