Build bounded non-one-shot text-to-SQL stack for local evals

ID	MCP_ANALYST_AGENT-BUILD_BOUNDED_NON_ONE_SHOT_TEXT_TO_SQL_STACK_FOR_LOCAL_EVALS
Status	not_started
Priority	p1
Milestone	m2-internal-adoption-design-partners
Owner	data-ai-engineer-architect
Initiative	benchmark-driven-text-to-sql-and-discovery-evals

Problem

Build a new experimental text-to-SQL stack for local evals that is better than the current one-shot baseline, but still cheap enough to build with the pieces we already have. The goal is not a full agentic system yet. The goal is a bounded non-one-shot backend that can be run inside apps/evals sql and compared honestly against the existing dataface backend on the local benchmark.

The stack should target the low-hanging fruit from the SOTA gap analysis:

explicit planning before final SQL
multiple SQL candidates instead of one
local validation and bounded repair before returning the final answer
simple evidence-based candidate selection

It should not require new retrieval infrastructure, a new semantic layer, or an open-ended warehouse exploration agent.

Context

What we already have

The repo already contains most of the primitives needed for a first non-one-shot stack:

dataface/ai/generate_sql.py - shared one-shot SQL generation used by product paths and evals
apps/evals/sql/backends.py - pluggable backend factory layer for benchmark runs
apps/evals/sql/context.py - existing context providers (dbt_package, inspect_cache)
apps/evals/sql/grounding.py and apps/evals/sql/scorer.py - local parse and grounding analysis
apps/evals/sql/runner.py - async benchmark runner with JSONL + summary output
dataface/ai/mcp/tools.py:execute_query - existing read-only execution primitive if we want optional local execution smoke checks

That means we do not need to invent a new eval framework to try a better stack. We already have the seam where a new backend can plug in.

Why this should be bounded and eval-first

The SOTA notes correctly point toward retrieval, exploration, repair, and selection, but most of that is still expensive or underspecified for us right now. We do not yet have:

a true search_context runtime
governed business-definition retrieval
a robust exploration-query policy
production-safe multi-step orchestration around warehouse execution

So the first version should stay intentionally smaller:

local eval backend first
bounded number of candidates
bounded number of repair attempts
validation driven by parse and grounding signals we already compute
optional execution only as a local developer mode, not as the required benchmark path

Proposed first stack

The first useful stack is:

plan - ask the model for a compact structured query plan or schema-link hypothesis
generate - generate 3-5 SQL candidates from that plan
validate - run local parse and grounding checks on each candidate
repair - give failing candidates one repair round using the concrete validation errors
select - choose the best surviving candidate with deterministic heuristics first

This gets us out of pure one-shot generation without forcing a full agent rewrite.

Explicit non-goals

Do not turn this task into:

a full Cloud/chat integration task
a live retrieval/indexing project
a value-exploration or distinct-values tool buildout
a broad benchmark-gating system
an attempt to solve semantic/business grounding in one pass

If the bounded backend works, richer retrieval and exploration can be follow-up tasks.

Possible Solutions

Recommended: build a bounded eval-only stack around the existing generator. Add a small staged pipeline that produces a structured plan, generates multiple SQL candidates, validates them locally with existing parse/grounding logic, repairs obviously fixable failures once, and selects the best candidate. Wire it in as a new apps/evals sql backend so it can be compared directly against the current dataface backend.

Why this is the right first step:

It directly addresses the biggest cheap gap: we currently generate one answer and stop.
It reuses code we already have instead of demanding new infrastructure.
It produces benchmark evidence quickly.
It keeps production risk low because the first integration point is eval-only.

Trade-off: it will still be weaker than true retrieval-plus-exploration systems because it does not inspect live values or search a richer business context domain.

Improve the existing one-shot prompt and stop there.

Trade-off: easiest to ship, but it does not test the structural change the gap analysis says we need. Better prompting alone is unlikely to produce a meaningful step change.

Jump straight to a full retrieval-plus-exploration agent.

Trade-off: directionally correct, but too much new surface area for a first iteration. It would mix retrieval design, tool design, execution safety, and candidate selection into one large task and make it hard to tell which part helped.

Build the new stack directly into Cloud/product flows before running it in evals.

Trade-off: higher operational risk and worse measurement discipline. We should prove it on the benchmark first, then decide whether it deserves product integration.

Plan

Recommended implementation

Build a new local-eval backend, likely something like dataface_bounded or dataface_non_one_shot, using a small shared module under dataface/ai/ for the staged flow and a thin factory hook in apps/evals/sql/backends.py.

Files likely involved

dataface/ai/generate_sql.py
dataface/ai/ new staged module for plan/candidate/repair/select
apps/evals/sql/backends.py
apps/evals/sql/grounding.py
apps/evals/sql/types.py
apps/evals/sql/cli.py
tests/evals/sql/test_backends.py
tests/evals/sql/test_runner.py
focused tests for the new staged stack under tests/ai/ or tests/evals/sql/

Step 1: extract a reusable local validation primitive

Do not duplicate parse and grounding logic inside the new backend. First extract or reuse a shared validation helper from the existing eval scoring path so the backend can ask:

does this SQL parse
which tables look hallucinated
which columns look hallucinated
what concrete error string should be fed back into repair

The same logic should stay the source of truth for both candidate repair and final eval scoring.

Step 2: add a structured planning stage

Add a small planning prompt that returns a typed object rather than final SQL. Keep it narrow. It only needs enough structure to improve later generation, for example:

candidate tables
candidate columns
requested metrics/dimensions
likely grain/grouping
filter hypotheses
ordering intent

This is not meant to be a universal semantic planner. It is just a way to force one intermediate reasoning step before SQL generation.

Step 3: generate multiple candidates from the plan

Add candidate generation that produces 3-5 SQL options instead of one. The candidates can come from:

one model call returning multiple candidates
or several small calls with slight variation instructions

Keep this bounded and deterministic enough for local eval use. The backend metadata should record:

candidate count requested
candidate count produced
planning mode
selection mode

Step 4: repair only the obviously fixable failures

For candidates that fail local validation, allow one bounded repair round that feeds the model:

the original question
the structured plan
the failing SQL
parse error and/or hallucinated table/column feedback

Repair should focus on cheap wins:

syntax errors
wrong table names
wrong column names
obvious alias/reference mistakes

Do not build an unbounded retry loop.

Step 5: select the best candidate with simple evidence

Selection should be deterministic-first, not judge-heavy. Prefer:

repaired or original candidates that parse successfully
candidates with no grounding failures
candidates with the fewest repairs / least damage
optional tie-break by a lightweight LLM judge only if deterministic signals cannot separate them

Persist enough metadata so leaderboard analysis can answer:

did the winner come from the first draft or repair
how many candidates were thrown out by parse failure
how often repair rescued a case

Step 6: expose it through the eval CLI

Add a new backend name to apps/evals sql, with a small number of explicit knobs such as:

candidate_count
repair_rounds
selection_mode
optional execution_source for local smoke mode only

This should remain a backend choice, not a new eval runner.

Step 7: validate on canary first

Before any broader rollout:

run the new backend on the canary benchmark
compare against the current dataface one-shot backend
inspect both pass-rate changes and failure-shape changes
only expand to a larger run if the canary shows real signal

Success criteria

New backend runs end-to-end via apps/evals sql
One-shot backend remains intact as the baseline
Result artifacts include enough metadata to analyze candidate and repair behavior
The canary run shows whether bounded non-one-shot search is promising enough to keep investing in
No production/chat path is forced to adopt the new stack yet

Implementation Progress

Not started. This task is intentionally implementation-ready rather than research-only.

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A - this is a local eval/backend task with no browser-facing UI requirement.

Review Feedback

[ ] Review cleared