Dataface Tasks

Capture candidate plan and repair telemetry in eval artifacts

IDMCP_ANALYST_AGENT-CAPTURE_CANDIDATE_PLAN_AND_REPAIR_TELEMETRY_IN_EVAL_ARTIFACTS
Statusnot_started
Priorityp1
Milestonem3-public-launch
Ownerdata-ai-engineer-architect

Problem

The bounded non-one-shot backend is only worth keeping if we can explain how it behaves. If a run improves, we need to know whether planning helped, candidate diversity helped, repair rescued bad drafts, or selection chose better winners. If a run regresses, we need the same visibility.

Without telemetry, bounded non-one-shot evals become a black box that is harder to tune than the one-shot baseline.

Context

The bounded stack task already proposes a sequence of plan -> generate -> validate -> repair -> select. The eval runner already emits structured JSONL and summary artifacts. That gives us a natural place to add more metadata.

Useful telemetry likely includes:

  • planner output or planner summary
  • requested vs produced candidate count
  • candidate validation failures
  • repair attempts and repair outcomes
  • winner source such as original draft vs repaired draft
  • deterministic selection reason

This should stay local-eval friendly. We do not need massive traces or token-level logs. We need enough evidence to understand behavior at run-analysis time.

Possible Solutions

  1. Recommended: extend eval artifacts with compact candidate/repair telemetry. Add structured metadata fields to case-level results and run summaries so dashboards and scripts can analyze what the bounded backend actually did.

Why this is recommended:

  • keeps observability inside the existing eval artifact format
  • makes the bounded backend explainable
  • supports later dashboards and regression analysis without replaying runs
  1. Keep the backend opaque and inspect behavior only through logs.

Trade-off: simpler to implement, but logs are hard to compare across runs and not suitable for leaderboard analysis.

  1. Store full prompts and all intermediate completions.

Trade-off: rich but heavy. That adds a lot of noise and storage cost before we know which fields are truly useful.

Plan

  1. Define a minimal telemetry contract for bounded runs.
  2. Add case-level metadata fields such as: - planning mode - candidate count requested/produced - parse and grounding rejection counts - repair attempts - winning candidate source - selection rationale code
  3. Add summary-level rollups such as: - repair rescue rate - average surviving candidate count - fraction of winners coming from repaired drafts
  4. Wire those fields into result writers and any dashboard queries that should visualize them.
  5. Keep the telemetry compact and deterministic so it remains stable across repeated runs.

Files likely involved

  • apps/evals/sql/types.py
  • apps/evals/sql/runner.py
  • bounded backend code under apps/evals/sql/ and/or dataface/ai/
  • eval dashboard queries and faces for telemetry slices

Success criteria

  • bounded eval artifacts explain how each case was solved or rejected
  • run summaries expose repair and selection behavior
  • later experiments can compare backend internals without re-running with debug logging

Implementation Progress

Not started.

QA Exploration

  • [x] QA exploration completed (or N/A for non-UI tasks)

N/A - eval telemetry task.

Review Feedback

No review feedback yet.

  • [ ] Review cleared