Capture candidate plan and repair telemetry in eval artifacts

ID	MCP_ANALYST_AGENT-CAPTURE_CANDIDATE_PLAN_AND_REPAIR_TELEMETRY_IN_EVAL_ARTIFACTS
Status	not_started
Priority	p1
Milestone	m3-public-launch
Owner	data-ai-engineer-architect

Problem

The bounded non-one-shot backend is only worth keeping if we can explain how it behaves. If a run improves, we need to know whether planning helped, candidate diversity helped, repair rescued bad drafts, or selection chose better winners. If a run regresses, we need the same visibility.

Without telemetry, bounded non-one-shot evals become a black box that is harder to tune than the one-shot baseline.

Context

The bounded stack task already proposes a sequence of plan -> generate -> validate -> repair -> select. The eval runner already emits structured JSONL and summary artifacts. That gives us a natural place to add more metadata.

Useful telemetry likely includes:

planner output or planner summary
requested vs produced candidate count
candidate validation failures
repair attempts and repair outcomes
winner source such as original draft vs repaired draft
deterministic selection reason

This should stay local-eval friendly. We do not need massive traces or token-level logs. We need enough evidence to understand behavior at run-analysis time.

Possible Solutions

Recommended: extend eval artifacts with compact candidate/repair telemetry. Add structured metadata fields to case-level results and run summaries so dashboards and scripts can analyze what the bounded backend actually did.

Why this is recommended:

keeps observability inside the existing eval artifact format
makes the bounded backend explainable
supports later dashboards and regression analysis without replaying runs

Keep the backend opaque and inspect behavior only through logs.

Trade-off: simpler to implement, but logs are hard to compare across runs and not suitable for leaderboard analysis.

Store full prompts and all intermediate completions.

Trade-off: rich but heavy. That adds a lot of noise and storage cost before we know which fields are truly useful.

Plan

Define a minimal telemetry contract for bounded runs.
Add case-level metadata fields such as: - planning mode - candidate count requested/produced - parse and grounding rejection counts - repair attempts - winning candidate source - selection rationale code
Add summary-level rollups such as: - repair rescue rate - average surviving candidate count - fraction of winners coming from repaired drafts
Wire those fields into result writers and any dashboard queries that should visualize them.
Keep the telemetry compact and deterministic so it remains stable across repeated runs.

Files likely involved

apps/evals/sql/types.py
apps/evals/sql/runner.py
bounded backend code under apps/evals/sql/ and/or dataface/ai/
eval dashboard queries and faces for telemetry slices

Success criteria

bounded eval artifacts explain how each case was solved or rejected
run summaries expose repair and selection behavior
later experiments can compare backend internals without re-running with debug logging

Implementation Progress

Not started.

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A - eval telemetry task.

Review Feedback

No review feedback yet.

[ ] Review cleared