Compare question bundle variants for text-to-SQL evals
Problem
Once the first retrieval-vs-full-context comparison exists, the next question is not just "does narrowing help?" but "which parts of the narrowed bundle help?" If we do not separate bundle ingredients, later retrieval work will become guesswork.
We need an experiment that compares different bundle shapes and additions so the team can learn whether SQL quality benefits more from:
- relationship edges
- value hints
- planner-oriented summaries
- richer explanations of why a table was included
Context
The M2 retrieval work already introduces:
- a searchable schema corpus
- a question-scoped bundle path
- a direct bundle-vs-full-context A/B task
That provides the baseline needed for a more structured follow-up experiment. This task should not become a giant factorial sweep. It should compare a small number of interpretable bundle variants using the same benchmark slice and same generation configuration.
Possible Solutions
- Recommended: run a small controlled bundle-variant matrix against the same benchmark slice.
Compare the full-context baseline and the default bundle against a few targeted variants such as
bundle + relationships,bundle + value hints, andbundle + plan hints.
Why this is recommended:
- isolates the value of specific context ingredients
- keeps the results easy to explain
- directly informs which retrieval follow-ups are worth implementing
- Change bundle structure ad hoc and rely on intuition.
Trade-off: faster initially, but poor learning. It becomes hard to tell which ingredient actually moved quality.
- Test many variants at once with a large experiment matrix.
Trade-off: more complete, but too much noise for a first comparison. Start smaller.
Plan
- Pick a stable benchmark slice and lock model/judge/backend settings.
- Define a small variant set, likely: - full context - default bundle - bundle plus relationships - bundle plus value hints - bundle plus planner-oriented summaries
- Run each condition with the same prompt family and benchmark slice.
- Compare: - pass/equivalence/grounding metrics - prompt-size reduction - retrieval-side metrics when gold labels exist
- Summarize which bundle ingredients are worth keeping by default and which should remain optional or deferred.
Explicit anti-goals
- no giant experiment wave
- no model comparison drift during the bundle comparison
- no search-speed work
Success criteria
- the team can point to specific bundle ingredients that help or hurt
- retrieval follow-up tasks become evidence-driven rather than speculative
Implementation Progress
Not started.
QA Exploration
- [x] QA exploration completed (or N/A for non-UI tasks)
N/A - eval comparison task.
Review Feedback
No review feedback yet.
- [ ] Review cleared