Add bounded execution-guided smoke repair mode for text-to-SQL evals

ID	MCP_ANALYST_AGENT-ADD_BOUNDED_EXECUTION_GUIDED_SMOKE_REPAIR_MODE_FOR_TEXT_TO_SQL_EVALS
Status	not_started
Priority	p2
Milestone	m4-v1-0-launch
Owner	data-ai-engineer-architect

Problem

The bounded non-one-shot stack can still miss easy wins when candidates parse and ground correctly but fail for execution-level reasons that are cheap to detect locally. Right now the stack intentionally avoids open-ended execution loops, but that does not mean all execution evidence should be ignored forever.

We need a bounded execution-guided smoke repair mode that captures some of that signal without turning evals into a warehouse agent.

Context

The bounded-stack task already allows optional local execution smoke checks as a future extension. Existing primitives likely include:

read-only execution helpers
parse and grounding validation
bounded candidate and repair loops

This task should keep the same guardrails:

local-only
optional
bounded attempts
no open-ended exploration

The goal is to use execution evidence only for candidates that already look plausible.

Possible Solutions

Recommended: add an optional bounded execution-guided repair mode behind an eval flag. Run a cheap execution smoke check only for candidates that already pass parse and grounding gates, then allow a limited repair attempt using the concrete execution failure.

Why this is recommended:

captures likely cheap wins
stays aligned with the bounded-stack philosophy
limits operational and conceptual complexity

Keep execution fully out of the eval loop.

Trade-off: simplest, but it may leave obvious repair opportunities on the table.

Build a full execution-guided agent loop.

Trade-off: powerful, but much too large and risky for this stage.

Plan

Define the exact gate for candidates eligible for smoke execution.
Add a backend option for execution-guided mode that is disabled by default.
Surface execution failures in the same repair channel used for parse/grounding feedback.
Keep attempts strictly bounded and record execution-guided rescues in telemetry.
Compare bounded execution-guided runs against the non-execution bounded baseline on the same benchmark slice.

Success criteria

optional execution-guided repair can run locally without changing the default eval path
the mode remains bounded and explainable
the team can measure whether execution evidence is worth the extra complexity

Implementation Progress

Not started.

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A - local eval backend task.

Review Feedback

No review feedback yet.

[ ] Review cleared