Design escalation-manager agent for stuck task triage and restart decisions
Problem
Define a future operator/agent layer above heartbeat that can inspect stale tasks, summarize likely failure modes, recommend restart vs takeover vs cleanup, and optionally perform bounded reconciliation when heartbeat alone is insufficient. Start with CLI/spec-first interfaces so heartbeat and other agents can call the same triage tools consistently.
Context
The task manager heartbeat (scripts/task-system/heartbeat) already detects stuck tasks and emits escalation signals: dispatch_failed, dispatch_stalled, stuck_in_progress, worker_idle, pickup_overdue. The in-progress reconciliation task adds metadata_drift, register_orphan, and dispatch_completed_unreconciled signals. These signals surface in the heartbeat snapshot JSON and tasks server UI under "Needs attention" and "Escalation signals."
Today, acting on these signals is fully manual — the operator reads the heartbeat, inspects dispatch logs, classifies state, and decides whether to resume, nuke+retry, or escalate. The friction log (tasks/logs/task-manager-friction-log-2026-03-24.md) documents 6 concrete stuck-task incidents. The cbox execution issues log (tasks/logs/cbox-execution-issues.md) documents 40+ sandbox failure modes including interactive stalls, hung tool calls, analysis loops, and missing-runtime dead ends.
Key files:
- scripts/task_manager_lib.py — heartbeat classification, reconciliation functions, register management
- scripts/task-system/heartbeat — main heartbeat loop emitting snapshots
- scripts/task-system/dispatch-watch — worker polling and state inspection
- .codex/skills/task-manager/SKILL.md — blocker recovery playbook (manual)
- tasks/logs/task-manager-friction-log-2026-03-24.md — concrete stuck-task incidents
- tasks/logs/cbox-execution-issues.md — sandbox failure taxonomy
Constraints: - Heartbeat must never mutate task frontmatter (advisory signals only) - Escalation manager must be CLI-first so heartbeat and other agents can call the same triage tools - Start with read-only triage (classify + recommend); bounded write actions (restart, cleanup) are opt-in - Must work with local worktree dispatch (current) and be forward-compatible with sandbox dispatch
Possible Solutions
A. Inline escalation logic in heartbeat
Add triage decision trees directly into heartbeat's classification loop. Simple but: - Bloats heartbeat with action-taking responsibility - Heartbeat is read-only by design; mixing in restarts violates that - Hard to test triage logic independently
B. Escalation manager as a CLI tool + skill — Recommended
A new scripts/task-system/escalate CLI entry point plus a .codex/skills/escalation-manager/SKILL.md skill definition. The escalation manager:
1. Reads the latest heartbeat snapshot to find tasks with escalation signals
2. For each escalated task, runs a triage pipeline: inspect dispatch log → classify failure mode → recommend action
3. Outputs a structured triage report (JSON + human-readable)
4. Optionally executes bounded recovery actions when --act is passed
This keeps heartbeat read-only, makes triage independently testable, and lets both humans and agents invoke the same interface.
C. Autonomous escalation daemon
A long-running process that watches for escalation signals and auto-acts. Overkill for current scale (2 operators, <20 concurrent tasks). The heartbeat loop already runs every 3 minutes; a CLI invoked by heartbeat achieves the same cadence without a separate daemon.
Plan
Deliverable: spec + CLI interface + skill definition + triage library. No autonomous action-taking in v1.
Files to create
.codex/skills/escalation-manager/SKILL.md— skill definition with triage playbookscripts/task-system/escalate— CLI entry point for triagescripts/escalation_lib.py— triage classification and recommendation logictests/scripts/test_escalation_lib.py— TDD tests for triage functions
Implementation steps
- Define the failure mode taxonomy (from friction log + cbox issues log)
- Write the skill definition (
.codex/skills/escalation-manager/SKILL.md) - Write failing tests for triage classification functions
- Implement
scripts/escalation_lib.pywith triage functions - Wire up
scripts/task-system/escalateCLI - Validate with
just task validateand targeted tests - Mark QA N/A (non-UI), update task, create PR
Implementation Progress
Failure Mode Taxonomy
Derived from friction log incidents and cbox execution issues, grouped by triage action:
| Failure Mode | Signal Source | Recommended Action |
|---|---|---|
worker_gone — dispatch process disappeared |
heartbeat watchdog | Inspect log tail → restart if no progress, resume if partial |
stuck_in_progress — task active too long with no commits |
heartbeat watchdog | Check worktree for uncommitted changes → resume or restart |
dispatch_stalled — worker running but no output |
heartbeat watchdog | Check if interactive prompt blocking → send interrupt or restart |
dispatch_completed_unreconciled — exit 0 but not completed |
reconciliation | Check worktree for PR/commits → reconcile metadata or flag |
metadata_drift — worktree task newer than root |
reconciliation | Copy frontmatter from worktree to root |
register_orphan — register entry with no worktree |
reconciliation | Prune register entry |
interactive_stall — worker blocked on prompt |
dispatch log inspection | Send interrupt, restart with stricter flags |
analysis_loop — worker exploring without edits |
dispatch log inspection | Restart with narrower prompt |
review_hang — review process stalled |
dispatch log inspection | Kill review, restart or skip with justification |
pickup_overdue — ready task not dispatched |
heartbeat watchdog | Check dependencies → dispatch or flag blocker |
Triage Pipeline
Each escalated task goes through:
1. read_escalation_signals(snapshot) → list of (slug, signals)
2. inspect_dispatch_log(slug) → log_tail, exit_code, last_activity_age
3. inspect_worktree_state(slug) → has_uncommitted, has_pr, commit_count, branch_exists
4. classify_failure_mode(signals, log_state, worktree_state) → FailureMode enum
5. recommend_action(failure_mode) → Action enum + rationale string
Action Enum (v1 — advisory only)
class TriageAction(str, Enum):
RESTART = "restart" # Nuke worktree + branch, re-dispatch
RESUME = "resume" # Send follow-up command to existing worker
RECONCILE = "reconcile" # Sync metadata from worktree to root
PRUNE = "prune" # Remove stale register/worktree entry
ESCALATE_HUMAN = "escalate" # Cannot auto-resolve, needs human
SKIP = "skip" # Signal is transient, no action needed
Skill Definition
Created .codex/skills/escalation-manager/SKILL.md — defines the triage playbook, CLI interface, and integration points with heartbeat.
CLI Interface
scripts/task-system/escalate [--owner OWNER] [--format text|json] [--act] [--slug SLUG]
--owner Filter to tasks owned by this manager (default: current user)
--format Output format (default: text)
--act Execute recommended actions (v1: reconcile + prune only)
--slug Triage a single task instead of all escalated tasks
Library Functions
scripts/escalation_lib.py provides:
- read_escalation_signals(snapshot_path) -> list[EscalatedTask]
- inspect_dispatch_log(slug) -> DispatchLogState
- inspect_worktree_state(slug) -> WorktreeState
- classify_failure_mode(signals, log_state, wt_state) -> FailureMode
- recommend_action(mode) -> TriageRecommendation
- triage_all(owner) -> list[TriageReport] (orchestrates the pipeline)
- execute_action(report) -> ActionResult (for --act mode, v1: reconcile + prune only)
Tests
See tests/scripts/test_escalation_lib.py for TDD tests covering classification and recommendation logic.
QA Exploration
- [x] QA exploration completed (or N/A for non-UI tasks)
- N/A: This is a design/spec task with no UI changes. Deliverables are the skill definition, library, CLI, and tests.
Review Feedback
- [ ] Review cleared