Compare text-to-SQL evals with question-aware retrieval vs full-context prompting
Problem
After the retrieval CLI and bundle integration land, run paired local evals comparing question-aware retrieval-and-isolation against the current full-context baseline. One condition uses the retrieval tool/bundle path, the other uses the full schema/info path with no retrieval step, and the task should capture the command matrix, output artifacts, and comparison criteria.
Context
This task depends on the retrieval setup work landing first:
- the local corpus/search/bundle CLI must exist
- the text-to-SQL backend/eval layer must be able to consume the resulting question-scoped bundle
The comparison should stay simple and direct:
- With retrieval tool/bundle: the generator gets only the question-scoped isolated bundle
- Without retrieval tool/bundle: the generator gets the current full schema/info context with no narrowing step
This task is not about proving the retrieval is elegant or fast. It is about proving whether even a simple narrowing layer improves downstream SQL quality enough to be worth keeping.
Possible Solutions
- Recommended: paired A/B eval runs on the same benchmark slices Run the same benchmark subset, same model, same judge, and same prompt family under two conditions: retrieved-and-isolated context versus full-context baseline. Compare both retrieval-specific and downstream SQL metrics.
Why this is recommended:
- isolates the value of narrowing itself
- keeps the experiment easy to explain
- avoids mixing retrieval effects with model or prompt changes
- Fold the retrieval condition into general model/context ablations and infer the result indirectly.
Trade-off: possible, but harder to interpret. This specific retrieval-vs-full-context question deserves its own direct comparison.
- Compare only retrieval metrics such as hit rate without running text-to-SQL evals.
Trade-off: not enough. The initiative exists to help the generator, so downstream SQL quality must be part of the verdict.
Plan
- Choose a stable benchmark slice or canary set.
- Lock the generation configuration: - same model - same judge - same backend except for context source
- Run condition A: full-context baseline.
- Run condition B: question-aware retrieval/bundle context.
- Capture run artifacts and note the exact commands used.
- Compare: - pass rate - parse rate - equivalence rate - grounding failures - prompt/context size reduction when available - bundle inclusion of gold tables/columns when available
- Summarize whether the simple retrieval path is: - clearly better - neutral - or worse than full-context prompting
Expected outputs
- paired run directories under the eval output path
- a short comparison summary
- explicit recommendation on whether to keep pushing the retrieval path
Explicit anti-goals
- no search-speed benchmarking
- no retrieval optimization sweep
- no broad factorial experiment matrix before this direct A/B exists
Implementation Progress
Not started.
QA Exploration
- [x] QA exploration completed (or N/A for non-UI tasks)
N/A - eval comparison task.
Review Feedback
- [ ] Review cleared