Dataface Tasks

Add structured planning evals separate from final SQL quality

IDMCP_ANALYST_AGENT-ADD_STRUCTURED_PLANNING_EVALS_SEPARATE_FROM_FINAL_SQL_QUALITY
Statusnot_started
Priorityp1
Milestonem3-public-launch
Ownerdata-ai-engineer-architect

Problem

If we add a planning stage but only evaluate final SQL, we still will not know whether planning is helping. A planner can pick the right tables, filters, and grain while generation fails later. Or the planner can be wrong while selection and repair happen to rescue the final query.

We need a way to score planning quality directly so the plan stage can improve as its own component.

Context

The bounded non-one-shot backend already proposes structured planning before candidate generation. Other planned benchmark improvements such as retrieval-side gold labels and schema-link supervision can provide useful supervision targets for planning evals.

Likely planner dimensions worth scoring include:

  • selected tables
  • selected columns
  • metric intent
  • grouping or grain intent
  • filter hypotheses

This should be a local-eval task first. No production planner rollout is required.

Possible Solutions

  1. Recommended: define a small planner-output contract and score it directly against benchmark annotations. Keep the planner narrow and typed enough that its output can be evaluated independently from the generated SQL.

Why this is recommended:

  • makes planning measurable
  • supports better debugging of the bounded stack
  • creates a clean separation between planner quality and SQL generation quality
  1. Evaluate planning only through downstream SQL improvement.

Trade-off: simpler, but it hides whether the planner is actually doing useful work.

  1. Let the planner stay freeform text.

Trade-off: flexible, but much harder to score consistently.

Plan

  1. Define the planner output fields that are worth standardizing.
  2. Align those fields with benchmark supervision where possible.
  3. Add a planner-eval path that scores overlap between planned targets and gold targets.
  4. Persist planner scores in eval artifacts.
  5. Compare planner quality separately from final SQL quality in dashboards or run summaries.

Files likely involved

  • bounded backend code under dataface/ai/ or apps/evals/sql/
  • apps/evals/sql/types.py
  • apps/evals/sql/runner.py
  • benchmark label/schema files once supervision fields exist

Success criteria

  • planning quality is measurable independently from final SQL
  • bounded-stack regressions can be localized to plan vs generation vs repair
  • future planner improvements can be evaluated without waiting for full SQL parity

Implementation Progress

Not started.

QA Exploration

  • [x] QA exploration completed (or N/A for non-UI tasks)

N/A - planner eval task.

Review Feedback

No review feedback yet.

  • [ ] Review cleared