Dataface Tasks

Build bounded non-one-shot text-to-SQL stack for local evals

IDMCP_ANALYST_AGENT-BUILD_BOUNDED_NON_ONE_SHOT_TEXT_TO_SQL_STACK_FOR_LOCAL_EVALS
Statusnot_started
Priorityp1
Milestonem2-internal-adoption-design-partners
Ownerdata-ai-engineer-architect
Initiativebenchmark-driven-text-to-sql-and-discovery-evals

Problem

Build a new experimental text-to-SQL stack for local evals that is better than the current one-shot baseline, but still cheap enough to build with the pieces we already have. The goal is not a full agentic system yet. The goal is a bounded non-one-shot backend that can be run inside apps/evals sql and compared honestly against the existing dataface backend on the local benchmark.

The stack should target the low-hanging fruit from the SOTA gap analysis:

  1. explicit planning before final SQL
  2. multiple SQL candidates instead of one
  3. local validation and bounded repair before returning the final answer
  4. simple evidence-based candidate selection

It should not require new retrieval infrastructure, a new semantic layer, or an open-ended warehouse exploration agent.

Context

What we already have

The repo already contains most of the primitives needed for a first non-one-shot stack:

  • dataface/ai/generate_sql.py - shared one-shot SQL generation used by product paths and evals
  • apps/evals/sql/backends.py - pluggable backend factory layer for benchmark runs
  • apps/evals/sql/context.py - existing context providers (dbt_package, inspect_cache)
  • apps/evals/sql/grounding.py and apps/evals/sql/scorer.py - local parse and grounding analysis
  • apps/evals/sql/runner.py - async benchmark runner with JSONL + summary output
  • dataface/ai/mcp/tools.py:execute_query - existing read-only execution primitive if we want optional local execution smoke checks

That means we do not need to invent a new eval framework to try a better stack. We already have the seam where a new backend can plug in.

Why this should be bounded and eval-first

The SOTA notes correctly point toward retrieval, exploration, repair, and selection, but most of that is still expensive or underspecified for us right now. We do not yet have:

  • a true search_context runtime
  • governed business-definition retrieval
  • a robust exploration-query policy
  • production-safe multi-step orchestration around warehouse execution

So the first version should stay intentionally smaller:

  • local eval backend first
  • bounded number of candidates
  • bounded number of repair attempts
  • validation driven by parse and grounding signals we already compute
  • optional execution only as a local developer mode, not as the required benchmark path

Proposed first stack

The first useful stack is:

  1. plan - ask the model for a compact structured query plan or schema-link hypothesis
  2. generate - generate 3-5 SQL candidates from that plan
  3. validate - run local parse and grounding checks on each candidate
  4. repair - give failing candidates one repair round using the concrete validation errors
  5. select - choose the best surviving candidate with deterministic heuristics first

This gets us out of pure one-shot generation without forcing a full agent rewrite.

Explicit non-goals

Do not turn this task into:

  • a full Cloud/chat integration task
  • a live retrieval/indexing project
  • a value-exploration or distinct-values tool buildout
  • a broad benchmark-gating system
  • an attempt to solve semantic/business grounding in one pass

If the bounded backend works, richer retrieval and exploration can be follow-up tasks.

Possible Solutions

  1. Recommended: build a bounded eval-only stack around the existing generator. Add a small staged pipeline that produces a structured plan, generates multiple SQL candidates, validates them locally with existing parse/grounding logic, repairs obviously fixable failures once, and selects the best candidate. Wire it in as a new apps/evals sql backend so it can be compared directly against the current dataface backend.

Why this is the right first step:

  • It directly addresses the biggest cheap gap: we currently generate one answer and stop.
  • It reuses code we already have instead of demanding new infrastructure.
  • It produces benchmark evidence quickly.
  • It keeps production risk low because the first integration point is eval-only.

Trade-off: it will still be weaker than true retrieval-plus-exploration systems because it does not inspect live values or search a richer business context domain.

  1. Improve the existing one-shot prompt and stop there.

Trade-off: easiest to ship, but it does not test the structural change the gap analysis says we need. Better prompting alone is unlikely to produce a meaningful step change.

  1. Jump straight to a full retrieval-plus-exploration agent.

Trade-off: directionally correct, but too much new surface area for a first iteration. It would mix retrieval design, tool design, execution safety, and candidate selection into one large task and make it hard to tell which part helped.

  1. Build the new stack directly into Cloud/product flows before running it in evals.

Trade-off: higher operational risk and worse measurement discipline. We should prove it on the benchmark first, then decide whether it deserves product integration.

Plan

Build a new local-eval backend, likely something like dataface_bounded or dataface_non_one_shot, using a small shared module under dataface/ai/ for the staged flow and a thin factory hook in apps/evals/sql/backends.py.

Files likely involved

  • dataface/ai/generate_sql.py
  • dataface/ai/ new staged module for plan/candidate/repair/select
  • apps/evals/sql/backends.py
  • apps/evals/sql/grounding.py
  • apps/evals/sql/types.py
  • apps/evals/sql/cli.py
  • tests/evals/sql/test_backends.py
  • tests/evals/sql/test_runner.py
  • focused tests for the new staged stack under tests/ai/ or tests/evals/sql/

Step 1: extract a reusable local validation primitive

Do not duplicate parse and grounding logic inside the new backend. First extract or reuse a shared validation helper from the existing eval scoring path so the backend can ask:

  • does this SQL parse
  • which tables look hallucinated
  • which columns look hallucinated
  • what concrete error string should be fed back into repair

The same logic should stay the source of truth for both candidate repair and final eval scoring.

Step 2: add a structured planning stage

Add a small planning prompt that returns a typed object rather than final SQL. Keep it narrow. It only needs enough structure to improve later generation, for example:

  • candidate tables
  • candidate columns
  • requested metrics/dimensions
  • likely grain/grouping
  • filter hypotheses
  • ordering intent

This is not meant to be a universal semantic planner. It is just a way to force one intermediate reasoning step before SQL generation.

Step 3: generate multiple candidates from the plan

Add candidate generation that produces 3-5 SQL options instead of one. The candidates can come from:

  • one model call returning multiple candidates
  • or several small calls with slight variation instructions

Keep this bounded and deterministic enough for local eval use. The backend metadata should record:

  • candidate count requested
  • candidate count produced
  • planning mode
  • selection mode

Step 4: repair only the obviously fixable failures

For candidates that fail local validation, allow one bounded repair round that feeds the model:

  • the original question
  • the structured plan
  • the failing SQL
  • parse error and/or hallucinated table/column feedback

Repair should focus on cheap wins:

  • syntax errors
  • wrong table names
  • wrong column names
  • obvious alias/reference mistakes

Do not build an unbounded retry loop.

Step 5: select the best candidate with simple evidence

Selection should be deterministic-first, not judge-heavy. Prefer:

  1. repaired or original candidates that parse successfully
  2. candidates with no grounding failures
  3. candidates with the fewest repairs / least damage
  4. optional tie-break by a lightweight LLM judge only if deterministic signals cannot separate them

Persist enough metadata so leaderboard analysis can answer:

  • did the winner come from the first draft or repair
  • how many candidates were thrown out by parse failure
  • how often repair rescued a case

Step 6: expose it through the eval CLI

Add a new backend name to apps/evals sql, with a small number of explicit knobs such as:

  • candidate_count
  • repair_rounds
  • selection_mode
  • optional execution_source for local smoke mode only

This should remain a backend choice, not a new eval runner.

Step 7: validate on canary first

Before any broader rollout:

  1. run the new backend on the canary benchmark
  2. compare against the current dataface one-shot backend
  3. inspect both pass-rate changes and failure-shape changes
  4. only expand to a larger run if the canary shows real signal

Success criteria

  • New backend runs end-to-end via apps/evals sql
  • One-shot backend remains intact as the baseline
  • Result artifacts include enough metadata to analyze candidate and repair behavior
  • The canary run shows whether bounded non-one-shot search is promising enough to keep investing in
  • No production/chat path is forced to adopt the new stack yet

Implementation Progress

Not started. This task is intentionally implementation-ready rather than research-only.

QA Exploration

  • [x] QA exploration completed (or N/A for non-UI tasks)

N/A - this is a local eval/backend task with no browser-facing UI requirement.

Review Feedback

  • [ ] Review cleared