Harden worker continuity and remove tmux manager surface
Problem
Task execution currently has two operator pain points:
- Worker continuity is weak. A dispatched worker is effectively one-shot (
claude -p), while operators often need to send follow-up instructions mid-task. This leads to accidental parallel workers in one worktree or unclear restart semantics. - The tmux/Jared surface (
just jared,task-manager-start/ensure/stop) adds a second control plane with poor visibility and low operator value. Most real usage is heartbeat + status snapshots + tasks server.
We need a single-worker, single-worktree model with explicit update/continue mechanics and clear collision protection.
Context
- Current dispatch flow:
scripts/dispatchstarts a one-shot worker process and writesdispatch-*.log+dispatch-*.status.json.scripts/dispatch-watchis the main status/polling path.- Heartbeat loop (
scripts/task-manager-run+scripts/task-manager-heartbeat) handles queue pickup and snapshots. - Legacy tmux manager flow:
just jared+scripts/task-manager-start/ensure/stopmaintain a tmux-backed operator shell.- This overlaps with heartbeat and creates two operational modes.
- Existing rule from task-manager skill:
1 task = 1 worktree = 1 branch = 1 PR. - We should preserve this rule and make collisions impossible by default.
Possible Solutions
- Recommended: Keep one-shot worker execution, but add explicit continuation/update commands and collision guards:
- Add worker lock/state for a worktree/slug (single active worker).
- Add
dispatch-update/dispatch-continuesemantics that either:- queue a follow-up prompt after current run exits, or
- interrupt and restart safely in the same worktree.
- Persist prompt/run history per task (
tasks/logs/dispatch-*.history.jsonl) for context handoff. - Remove tmux/Jared commands and docs from operator path.
- Alternative: switch to a persistent interactive worker session per task.
- Pro: direct “message same agent” behavior.
- Con: higher complexity, harder crash recovery, more state coupling.
- Alternative: keep tmux path as optional.
- Con: keeps two control planes and operator confusion.
Plan
- Remove tmux/Jared operator entrypoints from
justfile, docs, and skill guidance. - Keep heartbeat as the only default manager loop (
task-manager-run+ heartbeat snapshots). - Add script namespace for task-manager ops under
scripts/task-manager/(wrapper or migration path). - Introduce explicit single-worker collision guard and continuation/update workflow in dispatch tooling.
- Add tests for: - reject second concurrent worker for same task/worktree, - continue/update path correctness, - history persistence.
- Update
AGENTS.md+.codex/skills/task-manager/SKILL.mdwith the new operator flow.
Implementation Progress
- 2026-03-23 hardening pass (heartbeat/dispatch anti-wedge):
scripts/task-manager-heartbeat:- added
--max-dispatches-per-cycle(env:TASK_MANAGER_MAX_DISPATCHES_PER_CYCLE, default1) - reduced per-dispatch timeout default from
180sto45s(TASK_MANAGER_DISPATCH_TIMEOUT_SECONDS) - dispatch loop now caps attempts per cycle to keep heartbeat snapshots fresh
- added
scripts/task_manager_lib.py:- added dead-worker detection: status
running+ deadworker_pid=>dispatch_state=worker_gone inspect_dispatchnow honors terminal states in status json (failed/interrupted/exited/stalled/worker_gone)- added retry cooldown (
TASK_MANAGER_DISPATCH_RETRY_COOLDOWN_SECONDS, default300) classify_tasksnow allows ready tasks to become dispatchable again after cooldown instead of permanent suppression
- added dead-worker detection: status
- tests (
tests/scripts/test_task_manager_scripts.py):- dead running worker becomes
worker_goneand is retryable after cooldown - retry cooldown blocks immediate redispatch to prevent tight loops
- heartbeat
max-dispatches-per-cyclerespected - full script suite passes (
28 passed)
- dead running worker becomes
QA Exploration
- [x] QA exploration completed (or N/A for non-UI tasks) N/A (scripting/orchestration task; no browser UX change)
Review Feedback
- [ ] Review cleared