Compare text-to-SQL evals with question-aware retrieval vs full-context prompting

ID	CONTEXT_CATALOG_NIMBLE-COMPARE_TEXT_TO_SQL_EVALS_WITH_QUESTION_AWARE_RETRIEVAL_VS_FULL_CONTEXT_PROMPTING
Status	not_started
Priority	p1
Milestone	m2-internal-adoption-design-partners
Owner	data-ai-engineer-architect
Initiative	question-aware-schema-retrieval-and-narrowing

Problem

After the retrieval CLI and bundle integration land, run paired local evals comparing question-aware retrieval-and-isolation against the current full-context baseline. One condition uses the retrieval tool/bundle path, the other uses the full schema/info path with no retrieval step, and the task should capture the command matrix, output artifacts, and comparison criteria.

Context

This task depends on the retrieval setup work landing first:

the local corpus/search/bundle CLI must exist
the text-to-SQL backend/eval layer must be able to consume the resulting question-scoped bundle

The comparison should stay simple and direct:

With retrieval tool/bundle: the generator gets only the question-scoped isolated bundle
Without retrieval tool/bundle: the generator gets the current full schema/info context with no narrowing step

This task is not about proving the retrieval is elegant or fast. It is about proving whether even a simple narrowing layer improves downstream SQL quality enough to be worth keeping.

Possible Solutions

Recommended: paired A/B eval runs on the same benchmark slices Run the same benchmark subset, same model, same judge, and same prompt family under two conditions: retrieved-and-isolated context versus full-context baseline. Compare both retrieval-specific and downstream SQL metrics.

Why this is recommended:

isolates the value of narrowing itself
keeps the experiment easy to explain
avoids mixing retrieval effects with model or prompt changes

Fold the retrieval condition into general model/context ablations and infer the result indirectly.

Trade-off: possible, but harder to interpret. This specific retrieval-vs-full-context question deserves its own direct comparison.

Compare only retrieval metrics such as hit rate without running text-to-SQL evals.

Trade-off: not enough. The initiative exists to help the generator, so downstream SQL quality must be part of the verdict.

Plan

Choose a stable benchmark slice or canary set.
Lock the generation configuration: - same model - same judge - same backend except for context source
Run condition A: full-context baseline.
Run condition B: question-aware retrieval/bundle context.
Capture run artifacts and note the exact commands used.
Compare: - pass rate - parse rate - equivalence rate - grounding failures - prompt/context size reduction when available - bundle inclusion of gold tables/columns when available
Summarize whether the simple retrieval path is: - clearly better - neutral - or worse than full-context prompting

Expected outputs

paired run directories under the eval output path
a short comparison summary
explicit recommendation on whether to keep pushing the retrieval path

Explicit anti-goals

no search-speed benchmarking
no retrieval optimization sweep
no broad factorial experiment matrix before this direct A/B exists

Implementation Progress

Not started.

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A - eval comparison task.

Review Feedback

[ ] Review cleared