Dataface Tasks

Harden worker continuity and remove tmux manager surface

IDINFRA_TOOLING-HARDEN_WORKER_CONTINUITY_AND_REMOVE_TMUX_MANAGER_SURFACE
Statuscompleted
Priorityp1
Milestonem1-ft-analytics-analyst-pilot
Ownersr-engineer-architect
Completed bydave
Completed2026-03-24

Problem

Task execution currently has two operator pain points:

  1. Worker continuity is weak. A dispatched worker is effectively one-shot (claude -p), while operators often need to send follow-up instructions mid-task. This leads to accidental parallel workers in one worktree or unclear restart semantics.
  2. The tmux/Jared surface (just jared, task-manager-start/ensure/stop) adds a second control plane with poor visibility and low operator value. Most real usage is heartbeat + status snapshots + tasks server.

We need a single-worker, single-worktree model with explicit update/continue mechanics and clear collision protection.

Context

  • Current dispatch flow:
  • scripts/dispatch starts a one-shot worker process and writes dispatch-*.log + dispatch-*.status.json.
  • scripts/dispatch-watch is the main status/polling path.
  • Heartbeat loop (scripts/task-manager-run + scripts/task-manager-heartbeat) handles queue pickup and snapshots.
  • Legacy tmux manager flow:
  • just jared + scripts/task-manager-start/ensure/stop maintain a tmux-backed operator shell.
  • This overlaps with heartbeat and creates two operational modes.
  • Existing rule from task-manager skill: 1 task = 1 worktree = 1 branch = 1 PR.
  • We should preserve this rule and make collisions impossible by default.

Possible Solutions

  • Recommended: Keep one-shot worker execution, but add explicit continuation/update commands and collision guards:
  • Add worker lock/state for a worktree/slug (single active worker).
  • Add dispatch-update / dispatch-continue semantics that either:
    • queue a follow-up prompt after current run exits, or
    • interrupt and restart safely in the same worktree.
  • Persist prompt/run history per task (tasks/logs/dispatch-*.history.jsonl) for context handoff.
  • Remove tmux/Jared commands and docs from operator path.
  • Alternative: switch to a persistent interactive worker session per task.
  • Pro: direct “message same agent” behavior.
  • Con: higher complexity, harder crash recovery, more state coupling.
  • Alternative: keep tmux path as optional.
  • Con: keeps two control planes and operator confusion.

Plan

  1. Remove tmux/Jared operator entrypoints from justfile, docs, and skill guidance.
  2. Keep heartbeat as the only default manager loop (task-manager-run + heartbeat snapshots).
  3. Add script namespace for task-manager ops under scripts/task-manager/ (wrapper or migration path).
  4. Introduce explicit single-worker collision guard and continuation/update workflow in dispatch tooling.
  5. Add tests for: - reject second concurrent worker for same task/worktree, - continue/update path correctness, - history persistence.
  6. Update AGENTS.md + .codex/skills/task-manager/SKILL.md with the new operator flow.

Implementation Progress

  • 2026-03-23 hardening pass (heartbeat/dispatch anti-wedge):
  • scripts/task-manager-heartbeat:
    • added --max-dispatches-per-cycle (env: TASK_MANAGER_MAX_DISPATCHES_PER_CYCLE, default 1)
    • reduced per-dispatch timeout default from 180s to 45s (TASK_MANAGER_DISPATCH_TIMEOUT_SECONDS)
    • dispatch loop now caps attempts per cycle to keep heartbeat snapshots fresh
  • scripts/task_manager_lib.py:
    • added dead-worker detection: status running + dead worker_pid => dispatch_state=worker_gone
    • inspect_dispatch now honors terminal states in status json (failed/interrupted/exited/stalled/worker_gone)
    • added retry cooldown (TASK_MANAGER_DISPATCH_RETRY_COOLDOWN_SECONDS, default 300)
    • classify_tasks now allows ready tasks to become dispatchable again after cooldown instead of permanent suppression
  • tests (tests/scripts/test_task_manager_scripts.py):
    • dead running worker becomes worker_gone and is retryable after cooldown
    • retry cooldown blocks immediate redispatch to prevent tight loops
    • heartbeat max-dispatches-per-cycle respected
    • full script suite passes (28 passed)

QA Exploration

  • [x] QA exploration completed (or N/A for non-UI tasks) N/A (scripting/orchestration task; no browser UX change)

Review Feedback

  • [ ] Review cleared