Dataface Tasks

Compare question bundle variants for text-to-SQL evals

IDMCP_ANALYST_AGENT-COMPARE_QUESTION_BUNDLE_VARIANTS_FOR_TEXT_TO_SQL_EVALS
Statusnot_started
Priorityp1
Milestonem3-public-launch
Ownerdata-ai-engineer-architect

Problem

Once the first retrieval-vs-full-context comparison exists, the next question is not just "does narrowing help?" but "which parts of the narrowed bundle help?" If we do not separate bundle ingredients, later retrieval work will become guesswork.

We need an experiment that compares different bundle shapes and additions so the team can learn whether SQL quality benefits more from:

  • relationship edges
  • value hints
  • planner-oriented summaries
  • richer explanations of why a table was included

Context

The M2 retrieval work already introduces:

  • a searchable schema corpus
  • a question-scoped bundle path
  • a direct bundle-vs-full-context A/B task

That provides the baseline needed for a more structured follow-up experiment. This task should not become a giant factorial sweep. It should compare a small number of interpretable bundle variants using the same benchmark slice and same generation configuration.

Possible Solutions

  1. Recommended: run a small controlled bundle-variant matrix against the same benchmark slice. Compare the full-context baseline and the default bundle against a few targeted variants such as bundle + relationships, bundle + value hints, and bundle + plan hints.

Why this is recommended:

  • isolates the value of specific context ingredients
  • keeps the results easy to explain
  • directly informs which retrieval follow-ups are worth implementing
  1. Change bundle structure ad hoc and rely on intuition.

Trade-off: faster initially, but poor learning. It becomes hard to tell which ingredient actually moved quality.

  1. Test many variants at once with a large experiment matrix.

Trade-off: more complete, but too much noise for a first comparison. Start smaller.

Plan

  1. Pick a stable benchmark slice and lock model/judge/backend settings.
  2. Define a small variant set, likely: - full context - default bundle - bundle plus relationships - bundle plus value hints - bundle plus planner-oriented summaries
  3. Run each condition with the same prompt family and benchmark slice.
  4. Compare: - pass/equivalence/grounding metrics - prompt-size reduction - retrieval-side metrics when gold labels exist
  5. Summarize which bundle ingredients are worth keeping by default and which should remain optional or deferred.

Explicit anti-goals

  • no giant experiment wave
  • no model comparison drift during the bundle comparison
  • no search-speed work

Success criteria

  • the team can point to specific bundle ingredients that help or hurt
  • retrieval follow-up tasks become evidence-driven rather than speculative

Implementation Progress

Not started.

QA Exploration

  • [x] QA exploration completed (or N/A for non-UI tasks)

N/A - eval comparison task.

Review Feedback

No review feedback yet.

  • [ ] Review cleared