Dataface Tasks

Add bounded execution-guided smoke repair mode for text-to-SQL evals

IDMCP_ANALYST_AGENT-ADD_BOUNDED_EXECUTION_GUIDED_SMOKE_REPAIR_MODE_FOR_TEXT_TO_SQL_EVALS
Statusnot_started
Priorityp2
Milestonem4-v1-0-launch
Ownerdata-ai-engineer-architect

Problem

The bounded non-one-shot stack can still miss easy wins when candidates parse and ground correctly but fail for execution-level reasons that are cheap to detect locally. Right now the stack intentionally avoids open-ended execution loops, but that does not mean all execution evidence should be ignored forever.

We need a bounded execution-guided smoke repair mode that captures some of that signal without turning evals into a warehouse agent.

Context

The bounded-stack task already allows optional local execution smoke checks as a future extension. Existing primitives likely include:

  • read-only execution helpers
  • parse and grounding validation
  • bounded candidate and repair loops

This task should keep the same guardrails:

  • local-only
  • optional
  • bounded attempts
  • no open-ended exploration

The goal is to use execution evidence only for candidates that already look plausible.

Possible Solutions

  1. Recommended: add an optional bounded execution-guided repair mode behind an eval flag. Run a cheap execution smoke check only for candidates that already pass parse and grounding gates, then allow a limited repair attempt using the concrete execution failure.

Why this is recommended:

  • captures likely cheap wins
  • stays aligned with the bounded-stack philosophy
  • limits operational and conceptual complexity
  1. Keep execution fully out of the eval loop.

Trade-off: simplest, but it may leave obvious repair opportunities on the table.

  1. Build a full execution-guided agent loop.

Trade-off: powerful, but much too large and risky for this stage.

Plan

  1. Define the exact gate for candidates eligible for smoke execution.
  2. Add a backend option for execution-guided mode that is disabled by default.
  3. Surface execution failures in the same repair channel used for parse/grounding feedback.
  4. Keep attempts strictly bounded and record execution-guided rescues in telemetry.
  5. Compare bounded execution-guided runs against the non-execution bounded baseline on the same benchmark slice.

Success criteria

  • optional execution-guided repair can run locally without changing the default eval path
  • the mode remains bounded and explainable
  • the team can measure whether execution evidence is worth the extra complexity

Implementation Progress

Not started.

QA Exploration

  • [x] QA exploration completed (or N/A for non-UI tasks)

N/A - local eval backend task.

Review Feedback

No review feedback yet.

  • [ ] Review cleared