Add deterministic candidate selection baselines for bounded text-to-SQL

ID	MCP_ANALYST_AGENT-ADD_DETERMINISTIC_CANDIDATE_SELECTION_BASELINES_FOR_BOUNDED_TEXT_TO_SQL
Status	not_started
Priority	p1
Milestone	m3-public-launch
Owner	data-ai-engineer-architect

Problem

The bounded backend only becomes trustworthy if we understand how winners are selected. If we jump too quickly into more complex selection logic, we may end up adding complexity without proving that simple heuristics were insufficient.

We need explicit deterministic baselines for candidate selection before adding heavier judge-based or model-based selectors.

Context

The bounded stack already plans to generate multiple candidates, validate them, optionally repair them, and pick a winner. The existing validation signals already provide useful selection features:

parse success
grounding failures
hallucinated tables/columns
repair count
planner overlap once planner evals exist

That means we can benchmark selection strategies without inventing new infrastructure first.

Possible Solutions

Recommended: benchmark a small set of deterministic selection heuristics first. Compare strategies such as first valid parse, fewest grounding errors, least repaired, and best planner overlap before considering a heavier selector.

Why this is recommended:

cheap to implement
easy to explain
tells us whether more complex selection is even necessary

Add an LLM judge to select winners immediately.

Trade-off: potentially stronger, but more expensive and harder to debug before simple baselines are exhausted.

Hard-code one heuristic and never compare alternatives.

Trade-off: simplest, but risks locking in a weak selector by accident.

Plan

Define a small baseline set of deterministic selection strategies.
Expose selection mode as an eval backend knob.
Record which strategy selected the winner in artifacts.
Run the same benchmark slice across strategies and compare top-line and failure-taxonomy outcomes.
Keep the simplest strategy unless a more complex one shows clear measurable improvement.

Success criteria

bounded eval runs can compare multiple deterministic selection modes
the team has evidence for the default selector
later judge-based selection work is justified by baseline results instead of intuition

Implementation Progress

Not started.

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A - eval backend task.

Review Feedback

No review feedback yet.

[ ] Review cleared