Add bounded execution-guided smoke repair mode for text-to-SQL evals
Problem
The bounded non-one-shot stack can still miss easy wins when candidates parse and ground correctly but fail for execution-level reasons that are cheap to detect locally. Right now the stack intentionally avoids open-ended execution loops, but that does not mean all execution evidence should be ignored forever.
We need a bounded execution-guided smoke repair mode that captures some of that signal without turning evals into a warehouse agent.
Context
The bounded-stack task already allows optional local execution smoke checks as a future extension. Existing primitives likely include:
- read-only execution helpers
- parse and grounding validation
- bounded candidate and repair loops
This task should keep the same guardrails:
- local-only
- optional
- bounded attempts
- no open-ended exploration
The goal is to use execution evidence only for candidates that already look plausible.
Possible Solutions
- Recommended: add an optional bounded execution-guided repair mode behind an eval flag. Run a cheap execution smoke check only for candidates that already pass parse and grounding gates, then allow a limited repair attempt using the concrete execution failure.
Why this is recommended:
- captures likely cheap wins
- stays aligned with the bounded-stack philosophy
- limits operational and conceptual complexity
- Keep execution fully out of the eval loop.
Trade-off: simplest, but it may leave obvious repair opportunities on the table.
- Build a full execution-guided agent loop.
Trade-off: powerful, but much too large and risky for this stage.
Plan
- Define the exact gate for candidates eligible for smoke execution.
- Add a backend option for execution-guided mode that is disabled by default.
- Surface execution failures in the same repair channel used for parse/grounding feedback.
- Keep attempts strictly bounded and record execution-guided rescues in telemetry.
- Compare bounded execution-guided runs against the non-execution bounded baseline on the same benchmark slice.
Success criteria
- optional execution-guided repair can run locally without changing the default eval path
- the mode remains bounded and explainable
- the team can measure whether execution evidence is worth the extra complexity
Implementation Progress
Not started.
QA Exploration
- [x] QA exploration completed (or N/A for non-UI tasks)
N/A - local eval backend task.
Review Feedback
No review feedback yet.
- [ ] Review cleared