Harden worker continuity and remove tmux manager surface

ID	INFRA_TOOLING-HARDEN_WORKER_CONTINUITY_AND_REMOVE_TMUX_MANAGER_SURFACE
Status	completed
Priority	p1
Milestone	m1-ft-analytics-analyst-pilot
Owner	sr-engineer-architect
Completed by	dave
Completed	2026-03-24

Problem

Task execution currently has two operator pain points:

Worker continuity is weak. A dispatched worker is effectively one-shot (claude -p), while operators often need to send follow-up instructions mid-task. This leads to accidental parallel workers in one worktree or unclear restart semantics.
The tmux/Jared surface (just jared, task-manager-start/ensure/stop) adds a second control plane with poor visibility and low operator value. Most real usage is heartbeat + status snapshots + tasks server.

We need a single-worker, single-worktree model with explicit update/continue mechanics and clear collision protection.

Current dispatch flow:
scripts/dispatch starts a one-shot worker process and writes dispatch-*.log + dispatch-*.status.json.
scripts/dispatch-watch is the main status/polling path.
Heartbeat loop (scripts/task-manager-run + scripts/task-manager-heartbeat) handles queue pickup and snapshots.
Legacy tmux manager flow:
just jared + scripts/task-manager-start/ensure/stop maintain a tmux-backed operator shell.
This overlaps with heartbeat and creates two operational modes.
Existing rule from task-manager skill: 1 task = 1 worktree = 1 branch = 1 PR.
We should preserve this rule and make collisions impossible by default.

Recommended: Keep one-shot worker execution, but add explicit continuation/update commands and collision guards:
Add worker lock/state for a worktree/slug (single active worker).
Add dispatch-update / dispatch-continue semantics that either:
- queue a follow-up prompt after current run exits, or
- interrupt and restart safely in the same worktree.
Persist prompt/run history per task (tasks/logs/dispatch-*.history.jsonl) for context handoff.
Remove tmux/Jared commands and docs from operator path.
Alternative: switch to a persistent interactive worker session per task.
Pro: direct “message same agent” behavior.
Con: higher complexity, harder crash recovery, more state coupling.
Alternative: keep tmux path as optional.
Con: keeps two control planes and operator confusion.

Remove tmux/Jared operator entrypoints from justfile, docs, and skill guidance.
Keep heartbeat as the only default manager loop (task-manager-run + heartbeat snapshots).
Add script namespace for task-manager ops under scripts/task-manager/ (wrapper or migration path).
Introduce explicit single-worker collision guard and continuation/update workflow in dispatch tooling.
Add tests for: - reject second concurrent worker for same task/worktree, - continue/update path correctness, - history persistence.
Update AGENTS.md + .codex/skills/task-manager/SKILL.md with the new operator flow.

2026-03-23 hardening pass (heartbeat/dispatch anti-wedge):
scripts/task-manager-heartbeat:
- added --max-dispatches-per-cycle (env: TASK_MANAGER_MAX_DISPATCHES_PER_CYCLE, default 1)
- reduced per-dispatch timeout default from 180s to 45s (TASK_MANAGER_DISPATCH_TIMEOUT_SECONDS)
- dispatch loop now caps attempts per cycle to keep heartbeat snapshots fresh
scripts/task_manager_lib.py:
- added dead-worker detection: status running + dead worker_pid => dispatch_state=worker_gone
- inspect_dispatch now honors terminal states in status json (failed/interrupted/exited/stalled/worker_gone)
- added retry cooldown (TASK_MANAGER_DISPATCH_RETRY_COOLDOWN_SECONDS, default 300)
- classify_tasks now allows ready tasks to become dispatchable again after cooldown instead of permanent suppression
tests (tests/scripts/test_task_manager_scripts.py):
- dead running worker becomes worker_gone and is retryable after cooldown
- retry cooldown blocks immediate redispatch to prevent tight loops
- heartbeat max-dispatches-per-cycle respected
- full script suite passes (28 passed)

[x] QA exploration completed (or N/A for non-UI tasks) N/A (scripting/orchestration task; no browser UX change)