Compare question bundle variants for text-to-SQL evals

ID	MCP_ANALYST_AGENT-COMPARE_QUESTION_BUNDLE_VARIANTS_FOR_TEXT_TO_SQL_EVALS
Status	not_started
Priority	p1
Milestone	m3-public-launch
Owner	data-ai-engineer-architect

Problem

Once the first retrieval-vs-full-context comparison exists, the next question is not just "does narrowing help?" but "which parts of the narrowed bundle help?" If we do not separate bundle ingredients, later retrieval work will become guesswork.

We need an experiment that compares different bundle shapes and additions so the team can learn whether SQL quality benefits more from:

relationship edges
value hints
planner-oriented summaries
richer explanations of why a table was included

Context

The M2 retrieval work already introduces:

a searchable schema corpus
a question-scoped bundle path
a direct bundle-vs-full-context A/B task

That provides the baseline needed for a more structured follow-up experiment. This task should not become a giant factorial sweep. It should compare a small number of interpretable bundle variants using the same benchmark slice and same generation configuration.

Possible Solutions

Recommended: run a small controlled bundle-variant matrix against the same benchmark slice. Compare the full-context baseline and the default bundle against a few targeted variants such as bundle + relationships, bundle + value hints, and bundle + plan hints.

Why this is recommended:

isolates the value of specific context ingredients
keeps the results easy to explain
directly informs which retrieval follow-ups are worth implementing

Change bundle structure ad hoc and rely on intuition.

Trade-off: faster initially, but poor learning. It becomes hard to tell which ingredient actually moved quality.

Test many variants at once with a large experiment matrix.

Trade-off: more complete, but too much noise for a first comparison. Start smaller.

Plan

Pick a stable benchmark slice and lock model/judge/backend settings.
Define a small variant set, likely: - full context - default bundle - bundle plus relationships - bundle plus value hints - bundle plus planner-oriented summaries
Run each condition with the same prompt family and benchmark slice.
Compare: - pass/equivalence/grounding metrics - prompt-size reduction - retrieval-side metrics when gold labels exist
Summarize which bundle ingredients are worth keeping by default and which should remain optional or deferred.

Explicit anti-goals

no giant experiment wave
no model comparison drift during the bundle comparison
no search-speed work

Success criteria

the team can point to specific bundle ingredients that help or hurt
retrieval follow-up tasks become evidence-driven rather than speculative

Implementation Progress

Not started.

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A - eval comparison task.

Review Feedback

No review feedback yet.

[ ] Review cleared