Dataface Tasks

Compare text-to-SQL evals with question-aware retrieval vs full-context prompting

IDCONTEXT_CATALOG_NIMBLE-COMPARE_TEXT_TO_SQL_EVALS_WITH_QUESTION_AWARE_RETRIEVAL_VS_FULL_CONTEXT_PROMPTING
Statusnot_started
Priorityp1
Milestonem2-internal-adoption-design-partners
Ownerdata-ai-engineer-architect
Initiativequestion-aware-schema-retrieval-and-narrowing

Problem

After the retrieval CLI and bundle integration land, run paired local evals comparing question-aware retrieval-and-isolation against the current full-context baseline. One condition uses the retrieval tool/bundle path, the other uses the full schema/info path with no retrieval step, and the task should capture the command matrix, output artifacts, and comparison criteria.

Context

This task depends on the retrieval setup work landing first:

  • the local corpus/search/bundle CLI must exist
  • the text-to-SQL backend/eval layer must be able to consume the resulting question-scoped bundle

The comparison should stay simple and direct:

  1. With retrieval tool/bundle: the generator gets only the question-scoped isolated bundle
  2. Without retrieval tool/bundle: the generator gets the current full schema/info context with no narrowing step

This task is not about proving the retrieval is elegant or fast. It is about proving whether even a simple narrowing layer improves downstream SQL quality enough to be worth keeping.

Possible Solutions

  1. Recommended: paired A/B eval runs on the same benchmark slices Run the same benchmark subset, same model, same judge, and same prompt family under two conditions: retrieved-and-isolated context versus full-context baseline. Compare both retrieval-specific and downstream SQL metrics.

Why this is recommended:

  • isolates the value of narrowing itself
  • keeps the experiment easy to explain
  • avoids mixing retrieval effects with model or prompt changes
  1. Fold the retrieval condition into general model/context ablations and infer the result indirectly.

Trade-off: possible, but harder to interpret. This specific retrieval-vs-full-context question deserves its own direct comparison.

  1. Compare only retrieval metrics such as hit rate without running text-to-SQL evals.

Trade-off: not enough. The initiative exists to help the generator, so downstream SQL quality must be part of the verdict.

Plan

  1. Choose a stable benchmark slice or canary set.
  2. Lock the generation configuration: - same model - same judge - same backend except for context source
  3. Run condition A: full-context baseline.
  4. Run condition B: question-aware retrieval/bundle context.
  5. Capture run artifacts and note the exact commands used.
  6. Compare: - pass rate - parse rate - equivalence rate - grounding failures - prompt/context size reduction when available - bundle inclusion of gold tables/columns when available
  7. Summarize whether the simple retrieval path is: - clearly better - neutral - or worse than full-context prompting

Expected outputs

  • paired run directories under the eval output path
  • a short comparison summary
  • explicit recommendation on whether to keep pushing the retrieval path

Explicit anti-goals

  • no search-speed benchmarking
  • no retrieval optimization sweep
  • no broad factorial experiment matrix before this direct A/B exists

Implementation Progress

Not started.

QA Exploration

  • [x] QA exploration completed (or N/A for non-UI tasks)

N/A - eval comparison task.

Review Feedback

  • [ ] Review cleared