Iterate on question-aware retrieval with interface and result experiments

ID	CONTEXT_CATALOG_NIMBLE-ITERATE_ON_QUESTION_AWARE_RETRIEVAL_WITH_INTERFACE_AND_RESULT_EXPERIMENTS
Status	not_started
Priority	p2
Milestone	m2-internal-adoption-design-partners
Owner	data-ai-engineer-architect
Initiative	question-aware-schema-retrieval-and-narrowing

Problem

After the initial A/B eval comparing retrieval versus full-context prompting, run a small set of follow-up experiments to improve the question-aware retrieval path. Focus on interface and result quality experiments such as different search outputs, bundle shapes, ranking heuristics, and isolation policies rather than speed or indexing optimization.

Context

This task should only start after the direct A/B comparison exists. The goal is not to optimize blindly; it is to use the first comparison to identify where the simple retrieval path is weak and then run a few targeted experiments.

The most likely iteration axes are:

search interface: what form of output best helps the agent decide what to use
bundle shape: what the generator should actually see after isolation
ranking heuristics: table-first vs column-first vs relationship-aware boosts
isolation policy: how aggressively to trim tables, columns, and descriptions

This task should stay aligned with the M2 philosophy:

no performance tuning
no fancy indexing work
no embedding/vector detour unless the simple approach clearly fails
small experiments that improve usefulness for the agent

Possible Solutions

Recommended: run a small post-A/B experiment matrix over interface and result-shaping Use the baseline comparison results to choose a few targeted retrieval experiments, then compare them on the same eval slice. Focus on things the agent actually experiences: ranked result format, bundle composition, and simple heuristic changes.

Why this is recommended:

keeps experimentation tied to observed failures
improves the usefulness of the retriever without changing the basic architecture
avoids premature optimization

Keep changing the retriever ad hoc without running structured comparisons.

Trade-off: faster in the moment, but hard to learn which changes actually help.

Jump immediately to a new retrieval architecture such as embeddings or a dedicated service.

Trade-off: too big for a follow-up iteration task and not aligned with the current M2 simplicity constraint.

Plan

Review the first A/B comparison and identify the dominant failure modes.
Pick a small experiment matrix, for example: - result list shape: terse vs richer explanations - bundle composition: table summaries only vs selected columns + relationships - ranking heuristic: exact-name heavy vs description/role boosted - isolation policy: narrow vs medium bundle sizes
Run the same eval slice across those variants.
Compare downstream SQL quality and note which retrieval presentation/shape helps most.
Keep the winning variant if it materially helps; otherwise keep the simpler version.

Example experiment ideas

Search output includes only ranked IDs vs IDs plus short "why matched" text
Bundle keeps top 2 tables vs top 4 tables
Bundle keeps only top columns per table vs all columns from retained tables
Relationship edges included vs omitted
Table-name-heavy ranking vs description-heavy ranking

Explicit anti-goals

no speed benchmarking
no index optimization task
no large experiment wave disconnected from the first A/B evidence
no replacing the simple retrieval architecture during M2

Implementation Progress

Not started.

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A - retrieval experiment planning/eval task.

Review Feedback

[ ] Review cleared