Build bounded non-one-shot text-to-SQL stack for local evals
Problem
Build a new experimental text-to-SQL stack for local evals that is better than the current one-shot baseline, but still cheap enough to build with the pieces we already have. The goal is not a full agentic system yet. The goal is a bounded non-one-shot backend that can be run inside apps/evals sql and compared honestly against the existing dataface backend on the local benchmark.
The stack should target the low-hanging fruit from the SOTA gap analysis:
- explicit planning before final SQL
- multiple SQL candidates instead of one
- local validation and bounded repair before returning the final answer
- simple evidence-based candidate selection
It should not require new retrieval infrastructure, a new semantic layer, or an open-ended warehouse exploration agent.
Context
What we already have
The repo already contains most of the primitives needed for a first non-one-shot stack:
dataface/ai/generate_sql.py- shared one-shot SQL generation used by product paths and evalsapps/evals/sql/backends.py- pluggable backend factory layer for benchmark runsapps/evals/sql/context.py- existing context providers (dbt_package,inspect_cache)apps/evals/sql/grounding.pyandapps/evals/sql/scorer.py- local parse and grounding analysisapps/evals/sql/runner.py- async benchmark runner with JSONL + summary outputdataface/ai/mcp/tools.py:execute_query- existing read-only execution primitive if we want optional local execution smoke checks
That means we do not need to invent a new eval framework to try a better stack. We already have the seam where a new backend can plug in.
Why this should be bounded and eval-first
The SOTA notes correctly point toward retrieval, exploration, repair, and selection, but most of that is still expensive or underspecified for us right now. We do not yet have:
- a true
search_contextruntime - governed business-definition retrieval
- a robust exploration-query policy
- production-safe multi-step orchestration around warehouse execution
So the first version should stay intentionally smaller:
- local eval backend first
- bounded number of candidates
- bounded number of repair attempts
- validation driven by parse and grounding signals we already compute
- optional execution only as a local developer mode, not as the required benchmark path
Proposed first stack
The first useful stack is:
plan- ask the model for a compact structured query plan or schema-link hypothesisgenerate- generate 3-5 SQL candidates from that planvalidate- run local parse and grounding checks on each candidaterepair- give failing candidates one repair round using the concrete validation errorsselect- choose the best surviving candidate with deterministic heuristics first
This gets us out of pure one-shot generation without forcing a full agent rewrite.
Explicit non-goals
Do not turn this task into:
- a full Cloud/chat integration task
- a live retrieval/indexing project
- a value-exploration or distinct-values tool buildout
- a broad benchmark-gating system
- an attempt to solve semantic/business grounding in one pass
If the bounded backend works, richer retrieval and exploration can be follow-up tasks.
Possible Solutions
- Recommended: build a bounded eval-only stack around the existing generator.
Add a small staged pipeline that produces a structured plan, generates multiple SQL candidates, validates them locally with existing parse/grounding logic, repairs obviously fixable failures once, and selects the best candidate. Wire it in as a new
apps/evals sqlbackend so it can be compared directly against the currentdatafacebackend.
Why this is the right first step:
- It directly addresses the biggest cheap gap: we currently generate one answer and stop.
- It reuses code we already have instead of demanding new infrastructure.
- It produces benchmark evidence quickly.
- It keeps production risk low because the first integration point is eval-only.
Trade-off: it will still be weaker than true retrieval-plus-exploration systems because it does not inspect live values or search a richer business context domain.
- Improve the existing one-shot prompt and stop there.
Trade-off: easiest to ship, but it does not test the structural change the gap analysis says we need. Better prompting alone is unlikely to produce a meaningful step change.
- Jump straight to a full retrieval-plus-exploration agent.
Trade-off: directionally correct, but too much new surface area for a first iteration. It would mix retrieval design, tool design, execution safety, and candidate selection into one large task and make it hard to tell which part helped.
- Build the new stack directly into Cloud/product flows before running it in evals.
Trade-off: higher operational risk and worse measurement discipline. We should prove it on the benchmark first, then decide whether it deserves product integration.
Plan
Recommended implementation
Build a new local-eval backend, likely something like dataface_bounded or dataface_non_one_shot, using a small shared module under dataface/ai/ for the staged flow and a thin factory hook in apps/evals/sql/backends.py.
Files likely involved
dataface/ai/generate_sql.pydataface/ai/new staged module for plan/candidate/repair/selectapps/evals/sql/backends.pyapps/evals/sql/grounding.pyapps/evals/sql/types.pyapps/evals/sql/cli.pytests/evals/sql/test_backends.pytests/evals/sql/test_runner.py- focused tests for the new staged stack under
tests/ai/ortests/evals/sql/
Step 1: extract a reusable local validation primitive
Do not duplicate parse and grounding logic inside the new backend. First extract or reuse a shared validation helper from the existing eval scoring path so the backend can ask:
- does this SQL parse
- which tables look hallucinated
- which columns look hallucinated
- what concrete error string should be fed back into repair
The same logic should stay the source of truth for both candidate repair and final eval scoring.
Step 2: add a structured planning stage
Add a small planning prompt that returns a typed object rather than final SQL. Keep it narrow. It only needs enough structure to improve later generation, for example:
- candidate tables
- candidate columns
- requested metrics/dimensions
- likely grain/grouping
- filter hypotheses
- ordering intent
This is not meant to be a universal semantic planner. It is just a way to force one intermediate reasoning step before SQL generation.
Step 3: generate multiple candidates from the plan
Add candidate generation that produces 3-5 SQL options instead of one. The candidates can come from:
- one model call returning multiple candidates
- or several small calls with slight variation instructions
Keep this bounded and deterministic enough for local eval use. The backend metadata should record:
- candidate count requested
- candidate count produced
- planning mode
- selection mode
Step 4: repair only the obviously fixable failures
For candidates that fail local validation, allow one bounded repair round that feeds the model:
- the original question
- the structured plan
- the failing SQL
- parse error and/or hallucinated table/column feedback
Repair should focus on cheap wins:
- syntax errors
- wrong table names
- wrong column names
- obvious alias/reference mistakes
Do not build an unbounded retry loop.
Step 5: select the best candidate with simple evidence
Selection should be deterministic-first, not judge-heavy. Prefer:
- repaired or original candidates that parse successfully
- candidates with no grounding failures
- candidates with the fewest repairs / least damage
- optional tie-break by a lightweight LLM judge only if deterministic signals cannot separate them
Persist enough metadata so leaderboard analysis can answer:
- did the winner come from the first draft or repair
- how many candidates were thrown out by parse failure
- how often repair rescued a case
Step 6: expose it through the eval CLI
Add a new backend name to apps/evals sql, with a small number of explicit knobs such as:
candidate_countrepair_roundsselection_mode- optional
execution_sourcefor local smoke mode only
This should remain a backend choice, not a new eval runner.
Step 7: validate on canary first
Before any broader rollout:
- run the new backend on the canary benchmark
- compare against the current
datafaceone-shot backend - inspect both pass-rate changes and failure-shape changes
- only expand to a larger run if the canary shows real signal
Success criteria
- New backend runs end-to-end via
apps/evals sql - One-shot backend remains intact as the baseline
- Result artifacts include enough metadata to analyze candidate and repair behavior
- The canary run shows whether bounded non-one-shot search is promising enough to keep investing in
- No production/chat path is forced to adopt the new stack yet
Implementation Progress
Not started. This task is intentionally implementation-ready rather than research-only.
QA Exploration
- [x] QA exploration completed (or N/A for non-UI tasks)
N/A - this is a local eval/backend task with no browser-facing UI requirement.
Review Feedback
- [ ] Review cleared