Add structured planning evals separate from final SQL quality
Problem
If we add a planning stage but only evaluate final SQL, we still will not know whether planning is helping. A planner can pick the right tables, filters, and grain while generation fails later. Or the planner can be wrong while selection and repair happen to rescue the final query.
We need a way to score planning quality directly so the plan stage can improve as its own component.
Context
The bounded non-one-shot backend already proposes structured planning before candidate generation. Other planned benchmark improvements such as retrieval-side gold labels and schema-link supervision can provide useful supervision targets for planning evals.
Likely planner dimensions worth scoring include:
- selected tables
- selected columns
- metric intent
- grouping or grain intent
- filter hypotheses
This should be a local-eval task first. No production planner rollout is required.
Possible Solutions
- Recommended: define a small planner-output contract and score it directly against benchmark annotations. Keep the planner narrow and typed enough that its output can be evaluated independently from the generated SQL.
Why this is recommended:
- makes planning measurable
- supports better debugging of the bounded stack
- creates a clean separation between planner quality and SQL generation quality
- Evaluate planning only through downstream SQL improvement.
Trade-off: simpler, but it hides whether the planner is actually doing useful work.
- Let the planner stay freeform text.
Trade-off: flexible, but much harder to score consistently.
Plan
- Define the planner output fields that are worth standardizing.
- Align those fields with benchmark supervision where possible.
- Add a planner-eval path that scores overlap between planned targets and gold targets.
- Persist planner scores in eval artifacts.
- Compare planner quality separately from final SQL quality in dashboards or run summaries.
Files likely involved
- bounded backend code under
dataface/ai/orapps/evals/sql/ apps/evals/sql/types.pyapps/evals/sql/runner.py- benchmark label/schema files once supervision fields exist
Success criteria
- planning quality is measurable independently from final SQL
- bounded-stack regressions can be localized to plan vs generation vs repair
- future planner improvements can be evaluated without waiting for full SQL parity
Implementation Progress
Not started.
QA Exploration
- [x] QA exploration completed (or N/A for non-UI tasks)
N/A - planner eval task.
Review Feedback
No review feedback yet.
- [ ] Review cleared