Dataface Tasks

Add deterministic candidate selection baselines for bounded text-to-SQL

IDMCP_ANALYST_AGENT-ADD_DETERMINISTIC_CANDIDATE_SELECTION_BASELINES_FOR_BOUNDED_TEXT_TO_SQL
Statusnot_started
Priorityp1
Milestonem3-public-launch
Ownerdata-ai-engineer-architect

Problem

The bounded backend only becomes trustworthy if we understand how winners are selected. If we jump too quickly into more complex selection logic, we may end up adding complexity without proving that simple heuristics were insufficient.

We need explicit deterministic baselines for candidate selection before adding heavier judge-based or model-based selectors.

Context

The bounded stack already plans to generate multiple candidates, validate them, optionally repair them, and pick a winner. The existing validation signals already provide useful selection features:

  • parse success
  • grounding failures
  • hallucinated tables/columns
  • repair count
  • planner overlap once planner evals exist

That means we can benchmark selection strategies without inventing new infrastructure first.

Possible Solutions

  1. Recommended: benchmark a small set of deterministic selection heuristics first. Compare strategies such as first valid parse, fewest grounding errors, least repaired, and best planner overlap before considering a heavier selector.

Why this is recommended:

  • cheap to implement
  • easy to explain
  • tells us whether more complex selection is even necessary
  1. Add an LLM judge to select winners immediately.

Trade-off: potentially stronger, but more expensive and harder to debug before simple baselines are exhausted.

  1. Hard-code one heuristic and never compare alternatives.

Trade-off: simplest, but risks locking in a weak selector by accident.

Plan

  1. Define a small baseline set of deterministic selection strategies.
  2. Expose selection mode as an eval backend knob.
  3. Record which strategy selected the winner in artifacts.
  4. Run the same benchmark slice across strategies and compare top-line and failure-taxonomy outcomes.
  5. Keep the simplest strategy unless a more complex one shows clear measurable improvement.

Success criteria

  • bounded eval runs can compare multiple deterministic selection modes
  • the team has evidence for the default selector
  • later judge-based selection work is justified by baseline results instead of intuition

Implementation Progress

Not started.

QA Exploration

  • [x] QA exploration completed (or N/A for non-UI tasks)

N/A - eval backend task.

Review Feedback

No review feedback yet.

  • [ ] Review cleared