Dataface Tasks

Implement two-tier task manager escalation watchdog and optional manager

IDINFRA_TOOLING-IMPLEMENT_TWO_TIER_TASK_MANAGER_ESCALATION
Statuscompleted
Priorityp1
Milestonem1-ft-analytics-analyst-pilot
Ownersr-engineer-architect
Completed bydave
Completed2026-03-22

Problem

Running a permanent Claude “manager” in tmux is optional for most work: heartbeat already dispatches ready tasks and writes snapshots. Operators still want clarity on when a human or high-context agent should intervene. Today there is no first-class escalation path from dumb automation to manager.

Context

  • Tier 1 (default): scripts/task-manager-heartbeat, task-manager-run, scripts/dispatch, tasks/logs/task_manager/*.snapshot.json, register JSON, dispatch logs under tasks/logs/dispatch-*.log.
  • Tier 2 (escalation): Optional tmux manager (just manage / Jared) or a one-shot claude session with a bundled context pack, only when signals fire.
  • Related exploration: upgrade-jared-task-manager-visibility-orchestration-and-operator-ux.

Possible Solutions

  • A — Signal-only (docs + snapshot fields): Define escalation signals and surface them in snapshot.json / heartbeat text; operator spawns manager manually. Lowest code, less automation.
  • B — Watchdog flags + notification: Tier 1 sets needs_escalation (or extends needs_attention) with reasons; optional desktop notification or log line; still manual manager.
  • C — Automated manager spawn (opt-in): On critical signals, run a configured command (e.g. open tmux window or claude with prompt file). Recommended direction only if guarded (rate limit, dry-run mode) to avoid spawn storms.

Tag B as the pragmatic first pass unless product explicitly wants C.

Plan

  • [x] Document canonical escalation signals (e.g. repeated dispatch non-zero for same slug, in_progress stale by task mtime / started_at, dispatch log idle, register/worktree mismatch).
  • [x] Implement detection in task_manager_lib.py / task-manager-heartbeat (or small helper module); append structured reasons to snapshot payload and text summary.
  • [x] Make optional tmux manager start path clearly non-default or triggered only when TASK_MANAGER_ESCALATION=1 / config flag (exact mechanism TBD).
  • [x] Add tests under tests/scripts/test_task_manager_scripts.py for at least one escalation signal.
  • [x] Update .codex/skills/task-manager/SKILL.md (or AGENTS) to describe two-tier operator model.

Implementation Progress

  • Added structured escalation signaling in scripts/task_manager_lib.py:
  • per-task escalation_reasons[] and needs_attention[] in snapshot tasks
  • top-level escalation.required, escalation.signals[], counts.escalation_signals
  • text heartbeat now includes an Escalation signals section when active
  • Implemented minimal watchdog detections for first pass:
  • dispatch_failed / dispatch_interrupted / dispatch_stalled
  • stuck_in_progress (task is in_progress and stale by idle threshold)
  • retained worker_idle / pickup_overdue as structured reason codes
  • Added merge-conflict prevention for started_at churn:
  • scripts/task-manager-heartbeat --dispatch-ready no longer edits task frontmatter by default
  • optional opt-in remains via --mark-started-on-dispatch or TASK_MANAGER_MARK_STARTED_ON_DISPATCH=1
  • Added/updated tests in tests/scripts/test_task_manager_scripts.py:
  • structured escalation snapshot + text assertions
  • default dispatch no longer mutates task status/frontmatter
  • explicit opt-in path still marks task started
  • Follow-on merge context:
  • PR #705 conflict resolution kept this task's escalation schema and --mark-started-on-dispatch opt-in semantics while adding PR-CI probe + .worktrees defaults.
  • Task-manager heartbeat now reflects both layers on main.

QA Exploration

  • N/A (non-UI infra/scripts task)
  • [x] QA exploration completed (or N/A for non-UI tasks)

Review Feedback

  • Review rounds completed via just review; follow-up fixes applied before merge.
  • [x] Review cleared