Design escalation-manager agent for stuck task triage and restart decisions

ID	INFRA_TOOLING-DESIGN_ESCALATION_MANAGER_AGENT_FOR_STUCK_TASK_TRIAGE_AND_RESTART_DECISIONS
Status	completed
Priority	p2
Milestone	m1-ft-analytics-analyst-pilot
Owner	sr-engineer-architect
Completed by	dave
Completed	2026-03-24

Problem

Define a future operator/agent layer above heartbeat that can inspect stale tasks, summarize likely failure modes, recommend restart vs takeover vs cleanup, and optionally perform bounded reconciliation when heartbeat alone is insufficient. Start with CLI/spec-first interfaces so heartbeat and other agents can call the same triage tools consistently.

Context

The task manager heartbeat (scripts/task-system/heartbeat) already detects stuck tasks and emits escalation signals: dispatch_failed, dispatch_stalled, stuck_in_progress, worker_idle, pickup_overdue. The in-progress reconciliation task adds metadata_drift, register_orphan, and dispatch_completed_unreconciled signals. These signals surface in the heartbeat snapshot JSON and tasks server UI under "Needs attention" and "Escalation signals."

Today, acting on these signals is fully manual — the operator reads the heartbeat, inspects dispatch logs, classifies state, and decides whether to resume, nuke+retry, or escalate. The friction log (tasks/logs/task-manager-friction-log-2026-03-24.md) documents 6 concrete stuck-task incidents. The cbox execution issues log (tasks/logs/cbox-execution-issues.md) documents 40+ sandbox failure modes including interactive stalls, hung tool calls, analysis loops, and missing-runtime dead ends.

Key files: - scripts/task_manager_lib.py — heartbeat classification, reconciliation functions, register management - scripts/task-system/heartbeat — main heartbeat loop emitting snapshots - scripts/task-system/dispatch-watch — worker polling and state inspection - .codex/skills/task-manager/SKILL.md — blocker recovery playbook (manual) - tasks/logs/task-manager-friction-log-2026-03-24.md — concrete stuck-task incidents - tasks/logs/cbox-execution-issues.md — sandbox failure taxonomy

Constraints: - Heartbeat must never mutate task frontmatter (advisory signals only) - Escalation manager must be CLI-first so heartbeat and other agents can call the same triage tools - Start with read-only triage (classify + recommend); bounded write actions (restart, cleanup) are opt-in - Must work with local worktree dispatch (current) and be forward-compatible with sandbox dispatch

Possible Solutions

A. Inline escalation logic in heartbeat

Add triage decision trees directly into heartbeat's classification loop. Simple but: - Bloats heartbeat with action-taking responsibility - Heartbeat is read-only by design; mixing in restarts violates that - Hard to test triage logic independently

B. Escalation manager as a CLI tool + skill — Recommended

A new scripts/task-system/escalate CLI entry point plus a .codex/skills/escalation-manager/SKILL.md skill definition. The escalation manager: 1. Reads the latest heartbeat snapshot to find tasks with escalation signals 2. For each escalated task, runs a triage pipeline: inspect dispatch log → classify failure mode → recommend action 3. Outputs a structured triage report (JSON + human-readable) 4. Optionally executes bounded recovery actions when --act is passed

This keeps heartbeat read-only, makes triage independently testable, and lets both humans and agents invoke the same interface.

C. Autonomous escalation daemon

A long-running process that watches for escalation signals and auto-acts. Overkill for current scale (2 operators, <20 concurrent tasks). The heartbeat loop already runs every 3 minutes; a CLI invoked by heartbeat achieves the same cadence without a separate daemon.

Plan

Deliverable: spec + CLI interface + skill definition + triage library. No autonomous action-taking in v1.

Files to create

.codex/skills/escalation-manager/SKILL.md — skill definition with triage playbook
scripts/task-system/escalate — CLI entry point for triage
scripts/escalation_lib.py — triage classification and recommendation logic
tests/scripts/test_escalation_lib.py — TDD tests for triage functions

Implementation steps

Define the failure mode taxonomy (from friction log + cbox issues log)
Write the skill definition (.codex/skills/escalation-manager/SKILL.md)
Write failing tests for triage classification functions
Implement scripts/escalation_lib.py with triage functions
Wire up scripts/task-system/escalate CLI
Validate with just task validate and targeted tests
Mark QA N/A (non-UI), update task, create PR

Implementation Progress

Failure Mode Taxonomy

Derived from friction log incidents and cbox execution issues, grouped by triage action:

Failure Mode	Signal Source	Recommended Action
`worker_gone` — dispatch process disappeared	heartbeat watchdog	Inspect log tail → restart if no progress, resume if partial
`stuck_in_progress` — task active too long with no commits	heartbeat watchdog	Check worktree for uncommitted changes → resume or restart
`dispatch_stalled` — worker running but no output	heartbeat watchdog	Check if interactive prompt blocking → send interrupt or restart
`dispatch_completed_unreconciled` — exit 0 but not completed	reconciliation	Check worktree for PR/commits → reconcile metadata or flag
`metadata_drift` — worktree task newer than root	reconciliation	Copy frontmatter from worktree to root
`register_orphan` — register entry with no worktree	reconciliation	Prune register entry
`interactive_stall` — worker blocked on prompt	dispatch log inspection	Send interrupt, restart with stricter flags
`analysis_loop` — worker exploring without edits	dispatch log inspection	Restart with narrower prompt
`review_hang` — review process stalled	dispatch log inspection	Kill review, restart or skip with justification
`pickup_overdue` — ready task not dispatched	heartbeat watchdog	Check dependencies → dispatch or flag blocker

Triage Pipeline

Each escalated task goes through:

1. read_escalation_signals(snapshot)    → list of (slug, signals)
2. inspect_dispatch_log(slug)           → log_tail, exit_code, last_activity_age
3. inspect_worktree_state(slug)         → has_uncommitted, has_pr, commit_count, branch_exists
4. classify_failure_mode(signals, log_state, worktree_state) → FailureMode enum
5. recommend_action(failure_mode)       → Action enum + rationale string

Action Enum (v1 — advisory only)

class TriageAction(str, Enum):
    RESTART = "restart"           # Nuke worktree + branch, re-dispatch
    RESUME = "resume"             # Send follow-up command to existing worker
    RECONCILE = "reconcile"       # Sync metadata from worktree to root
    PRUNE = "prune"               # Remove stale register/worktree entry
    ESCALATE_HUMAN = "escalate"   # Cannot auto-resolve, needs human
    SKIP = "skip"                 # Signal is transient, no action needed

Skill Definition

Created .codex/skills/escalation-manager/SKILL.md — defines the triage playbook, CLI interface, and integration points with heartbeat.

CLI Interface

scripts/task-system/escalate [--owner OWNER] [--format text|json] [--act] [--slug SLUG]

  --owner   Filter to tasks owned by this manager (default: current user)
  --format  Output format (default: text)
  --act     Execute recommended actions (v1: reconcile + prune only)
  --slug    Triage a single task instead of all escalated tasks

Library Functions

scripts/escalation_lib.py provides: - read_escalation_signals(snapshot_path) -> list[EscalatedTask] - inspect_dispatch_log(slug) -> DispatchLogState - inspect_worktree_state(slug) -> WorktreeState - classify_failure_mode(signals, log_state, wt_state) -> FailureMode - recommend_action(mode) -> TriageRecommendation - triage_all(owner) -> list[TriageReport] (orchestrates the pipeline) - execute_action(report) -> ActionResult (for --act mode, v1: reconcile + prune only)

Tests

See tests/scripts/test_escalation_lib.py for TDD tests covering classification and recommendation logic.

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)
N/A: This is a design/spec task with no UI changes. Deliverables are the skill definition, library, CLI, and tests.

Review Feedback

[ ] Review cleared