Dataface Tasks

Design escalation-manager agent for stuck task triage and restart decisions

IDINFRA_TOOLING-DESIGN_ESCALATION_MANAGER_AGENT_FOR_STUCK_TASK_TRIAGE_AND_RESTART_DECISIONS
Statuscompleted
Priorityp2
Milestonem1-ft-analytics-analyst-pilot
Ownersr-engineer-architect
Completed bydave
Completed2026-03-24

Problem

Define a future operator/agent layer above heartbeat that can inspect stale tasks, summarize likely failure modes, recommend restart vs takeover vs cleanup, and optionally perform bounded reconciliation when heartbeat alone is insufficient. Start with CLI/spec-first interfaces so heartbeat and other agents can call the same triage tools consistently.

Context

The task manager heartbeat (scripts/task-system/heartbeat) already detects stuck tasks and emits escalation signals: dispatch_failed, dispatch_stalled, stuck_in_progress, worker_idle, pickup_overdue. The in-progress reconciliation task adds metadata_drift, register_orphan, and dispatch_completed_unreconciled signals. These signals surface in the heartbeat snapshot JSON and tasks server UI under "Needs attention" and "Escalation signals."

Today, acting on these signals is fully manual — the operator reads the heartbeat, inspects dispatch logs, classifies state, and decides whether to resume, nuke+retry, or escalate. The friction log (tasks/logs/task-manager-friction-log-2026-03-24.md) documents 6 concrete stuck-task incidents. The cbox execution issues log (tasks/logs/cbox-execution-issues.md) documents 40+ sandbox failure modes including interactive stalls, hung tool calls, analysis loops, and missing-runtime dead ends.

Key files: - scripts/task_manager_lib.py — heartbeat classification, reconciliation functions, register management - scripts/task-system/heartbeat — main heartbeat loop emitting snapshots - scripts/task-system/dispatch-watch — worker polling and state inspection - .codex/skills/task-manager/SKILL.md — blocker recovery playbook (manual) - tasks/logs/task-manager-friction-log-2026-03-24.md — concrete stuck-task incidents - tasks/logs/cbox-execution-issues.md — sandbox failure taxonomy

Constraints: - Heartbeat must never mutate task frontmatter (advisory signals only) - Escalation manager must be CLI-first so heartbeat and other agents can call the same triage tools - Start with read-only triage (classify + recommend); bounded write actions (restart, cleanup) are opt-in - Must work with local worktree dispatch (current) and be forward-compatible with sandbox dispatch

Possible Solutions

A. Inline escalation logic in heartbeat

Add triage decision trees directly into heartbeat's classification loop. Simple but: - Bloats heartbeat with action-taking responsibility - Heartbeat is read-only by design; mixing in restarts violates that - Hard to test triage logic independently

A new scripts/task-system/escalate CLI entry point plus a .codex/skills/escalation-manager/SKILL.md skill definition. The escalation manager: 1. Reads the latest heartbeat snapshot to find tasks with escalation signals 2. For each escalated task, runs a triage pipeline: inspect dispatch log → classify failure mode → recommend action 3. Outputs a structured triage report (JSON + human-readable) 4. Optionally executes bounded recovery actions when --act is passed

This keeps heartbeat read-only, makes triage independently testable, and lets both humans and agents invoke the same interface.

C. Autonomous escalation daemon

A long-running process that watches for escalation signals and auto-acts. Overkill for current scale (2 operators, <20 concurrent tasks). The heartbeat loop already runs every 3 minutes; a CLI invoked by heartbeat achieves the same cadence without a separate daemon.

Plan

Deliverable: spec + CLI interface + skill definition + triage library. No autonomous action-taking in v1.

Files to create

  • .codex/skills/escalation-manager/SKILL.md — skill definition with triage playbook
  • scripts/task-system/escalate — CLI entry point for triage
  • scripts/escalation_lib.py — triage classification and recommendation logic
  • tests/scripts/test_escalation_lib.py — TDD tests for triage functions

Implementation steps

  1. Define the failure mode taxonomy (from friction log + cbox issues log)
  2. Write the skill definition (.codex/skills/escalation-manager/SKILL.md)
  3. Write failing tests for triage classification functions
  4. Implement scripts/escalation_lib.py with triage functions
  5. Wire up scripts/task-system/escalate CLI
  6. Validate with just task validate and targeted tests
  7. Mark QA N/A (non-UI), update task, create PR

Implementation Progress

Failure Mode Taxonomy

Derived from friction log incidents and cbox execution issues, grouped by triage action:

Failure Mode Signal Source Recommended Action
worker_gone — dispatch process disappeared heartbeat watchdog Inspect log tail → restart if no progress, resume if partial
stuck_in_progress — task active too long with no commits heartbeat watchdog Check worktree for uncommitted changes → resume or restart
dispatch_stalled — worker running but no output heartbeat watchdog Check if interactive prompt blocking → send interrupt or restart
dispatch_completed_unreconciled — exit 0 but not completed reconciliation Check worktree for PR/commits → reconcile metadata or flag
metadata_drift — worktree task newer than root reconciliation Copy frontmatter from worktree to root
register_orphan — register entry with no worktree reconciliation Prune register entry
interactive_stall — worker blocked on prompt dispatch log inspection Send interrupt, restart with stricter flags
analysis_loop — worker exploring without edits dispatch log inspection Restart with narrower prompt
review_hang — review process stalled dispatch log inspection Kill review, restart or skip with justification
pickup_overdue — ready task not dispatched heartbeat watchdog Check dependencies → dispatch or flag blocker

Triage Pipeline

Each escalated task goes through:

1. read_escalation_signals(snapshot)    → list of (slug, signals)
2. inspect_dispatch_log(slug)           → log_tail, exit_code, last_activity_age
3. inspect_worktree_state(slug)         → has_uncommitted, has_pr, commit_count, branch_exists
4. classify_failure_mode(signals, log_state, worktree_state) → FailureMode enum
5. recommend_action(failure_mode)       → Action enum + rationale string

Action Enum (v1 — advisory only)

class TriageAction(str, Enum):
    RESTART = "restart"           # Nuke worktree + branch, re-dispatch
    RESUME = "resume"             # Send follow-up command to existing worker
    RECONCILE = "reconcile"       # Sync metadata from worktree to root
    PRUNE = "prune"               # Remove stale register/worktree entry
    ESCALATE_HUMAN = "escalate"   # Cannot auto-resolve, needs human
    SKIP = "skip"                 # Signal is transient, no action needed

Skill Definition

Created .codex/skills/escalation-manager/SKILL.md — defines the triage playbook, CLI interface, and integration points with heartbeat.

CLI Interface

scripts/task-system/escalate [--owner OWNER] [--format text|json] [--act] [--slug SLUG]

  --owner   Filter to tasks owned by this manager (default: current user)
  --format  Output format (default: text)
  --act     Execute recommended actions (v1: reconcile + prune only)
  --slug    Triage a single task instead of all escalated tasks

Library Functions

scripts/escalation_lib.py provides: - read_escalation_signals(snapshot_path) -> list[EscalatedTask] - inspect_dispatch_log(slug) -> DispatchLogState - inspect_worktree_state(slug) -> WorktreeState - classify_failure_mode(signals, log_state, wt_state) -> FailureMode - recommend_action(mode) -> TriageRecommendation - triage_all(owner) -> list[TriageReport] (orchestrates the pipeline) - execute_action(report) -> ActionResult (for --act mode, v1: reconcile + prune only)

Tests

See tests/scripts/test_escalation_lib.py for TDD tests covering classification and recommendation logic.

QA Exploration

  • [x] QA exploration completed (or N/A for non-UI tasks)
  • N/A: This is a design/spec task with no UI changes. Deliverables are the skill definition, library, CLI, and tests.

Review Feedback

  • [ ] Review cleared