Add structured planning evals separate from final SQL quality

ID	MCP_ANALYST_AGENT-ADD_STRUCTURED_PLANNING_EVALS_SEPARATE_FROM_FINAL_SQL_QUALITY
Status	not_started
Priority	p1
Milestone	m3-public-launch
Owner	data-ai-engineer-architect

Problem

If we add a planning stage but only evaluate final SQL, we still will not know whether planning is helping. A planner can pick the right tables, filters, and grain while generation fails later. Or the planner can be wrong while selection and repair happen to rescue the final query.

We need a way to score planning quality directly so the plan stage can improve as its own component.

Context

The bounded non-one-shot backend already proposes structured planning before candidate generation. Other planned benchmark improvements such as retrieval-side gold labels and schema-link supervision can provide useful supervision targets for planning evals.

Likely planner dimensions worth scoring include:

selected tables
selected columns
metric intent
grouping or grain intent
filter hypotheses

This should be a local-eval task first. No production planner rollout is required.

Possible Solutions

Recommended: define a small planner-output contract and score it directly against benchmark annotations. Keep the planner narrow and typed enough that its output can be evaluated independently from the generated SQL.

Why this is recommended:

makes planning measurable
supports better debugging of the bounded stack
creates a clean separation between planner quality and SQL generation quality

Evaluate planning only through downstream SQL improvement.

Trade-off: simpler, but it hides whether the planner is actually doing useful work.

Let the planner stay freeform text.

Trade-off: flexible, but much harder to score consistently.

Plan

Define the planner output fields that are worth standardizing.
Align those fields with benchmark supervision where possible.
Add a planner-eval path that scores overlap between planned targets and gold targets.
Persist planner scores in eval artifacts.
Compare planner quality separately from final SQL quality in dashboards or run summaries.

Files likely involved

bounded backend code under dataface/ai/ or apps/evals/sql/
apps/evals/sql/types.py
apps/evals/sql/runner.py
benchmark label/schema files once supervision fields exist

Success criteria

planning quality is measurable independently from final SQL
bounded-stack regressions can be localized to plan vs generation vs repair
future planner improvements can be evaluated without waiting for full SQL parity

Implementation Progress

Not started.

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A - planner eval task.

Review Feedback

No review feedback yet.

[ ] Review cleared