Add deterministic candidate selection baselines for bounded text-to-SQL
Problem
The bounded backend only becomes trustworthy if we understand how winners are selected. If we jump too quickly into more complex selection logic, we may end up adding complexity without proving that simple heuristics were insufficient.
We need explicit deterministic baselines for candidate selection before adding heavier judge-based or model-based selectors.
Context
The bounded stack already plans to generate multiple candidates, validate them, optionally repair them, and pick a winner. The existing validation signals already provide useful selection features:
- parse success
- grounding failures
- hallucinated tables/columns
- repair count
- planner overlap once planner evals exist
That means we can benchmark selection strategies without inventing new infrastructure first.
Possible Solutions
- Recommended: benchmark a small set of deterministic selection heuristics first. Compare strategies such as first valid parse, fewest grounding errors, least repaired, and best planner overlap before considering a heavier selector.
Why this is recommended:
- cheap to implement
- easy to explain
- tells us whether more complex selection is even necessary
- Add an LLM judge to select winners immediately.
Trade-off: potentially stronger, but more expensive and harder to debug before simple baselines are exhausted.
- Hard-code one heuristic and never compare alternatives.
Trade-off: simplest, but risks locking in a weak selector by accident.
Plan
- Define a small baseline set of deterministic selection strategies.
- Expose selection mode as an eval backend knob.
- Record which strategy selected the winner in artifacts.
- Run the same benchmark slice across strategies and compare top-line and failure-taxonomy outcomes.
- Keep the simplest strategy unless a more complex one shows clear measurable improvement.
Success criteria
- bounded eval runs can compare multiple deterministic selection modes
- the team has evidence for the default selector
- later judge-based selection work is justified by baseline results instead of intuition
Implementation Progress
Not started.
QA Exploration
- [x] QA exploration completed (or N/A for non-UI tasks)
N/A - eval backend task.
Review Feedback
No review feedback yet.
- [ ] Review cleared