Capture candidate plan and repair telemetry in eval artifacts
Problem
The bounded non-one-shot backend is only worth keeping if we can explain how it behaves. If a run improves, we need to know whether planning helped, candidate diversity helped, repair rescued bad drafts, or selection chose better winners. If a run regresses, we need the same visibility.
Without telemetry, bounded non-one-shot evals become a black box that is harder to tune than the one-shot baseline.
Context
The bounded stack task already proposes a sequence of plan -> generate -> validate -> repair -> select. The eval runner already emits structured JSONL and summary artifacts. That gives us a natural place to add more metadata.
Useful telemetry likely includes:
- planner output or planner summary
- requested vs produced candidate count
- candidate validation failures
- repair attempts and repair outcomes
- winner source such as original draft vs repaired draft
- deterministic selection reason
This should stay local-eval friendly. We do not need massive traces or token-level logs. We need enough evidence to understand behavior at run-analysis time.
Possible Solutions
- Recommended: extend eval artifacts with compact candidate/repair telemetry. Add structured metadata fields to case-level results and run summaries so dashboards and scripts can analyze what the bounded backend actually did.
Why this is recommended:
- keeps observability inside the existing eval artifact format
- makes the bounded backend explainable
- supports later dashboards and regression analysis without replaying runs
- Keep the backend opaque and inspect behavior only through logs.
Trade-off: simpler to implement, but logs are hard to compare across runs and not suitable for leaderboard analysis.
- Store full prompts and all intermediate completions.
Trade-off: rich but heavy. That adds a lot of noise and storage cost before we know which fields are truly useful.
Plan
- Define a minimal telemetry contract for bounded runs.
- Add case-level metadata fields such as: - planning mode - candidate count requested/produced - parse and grounding rejection counts - repair attempts - winning candidate source - selection rationale code
- Add summary-level rollups such as: - repair rescue rate - average surviving candidate count - fraction of winners coming from repaired drafts
- Wire those fields into result writers and any dashboard queries that should visualize them.
- Keep the telemetry compact and deterministic so it remains stable across repeated runs.
Files likely involved
apps/evals/sql/types.pyapps/evals/sql/runner.py- bounded backend code under
apps/evals/sql/and/ordataface/ai/ - eval dashboard queries and faces for telemetry slices
Success criteria
- bounded eval artifacts explain how each case was solved or rejected
- run summaries expose repair and selection behavior
- later experiments can compare backend internals without re-running with debug logging
Implementation Progress
Not started.
QA Exploration
- [x] QA exploration completed (or N/A for non-UI tasks)
N/A - eval telemetry task.
Review Feedback
No review feedback yet.
- [ ] Review cleared