tasks/logs/cbox-review-hang-analysis-2026-03-05.md

CBox Review "Hang" Analysis (March 5, 2026)

Executive summary

cbox review has two behaviors that were conflated:

  1. Slow-but-healthy review: can legitimately run 6-15 minutes (sometimes longer).
  2. True hang/stall: no meaningful progress and/or no completion beyond expected timeout.

The system currently has heartbeats and timeout guards, but in sandbox-driven flows the operator signal is still ambiguous enough that "silent" can be mistaken for "hung".

What code does today

Primary paths: - Inside container/sandbox: cbox review ... -> _run_review_in_tmux(...) - Host: cbox review ... -> ephemeral container + _run_subprocess_with_heartbeat(...)

Current guardrails already present: - Default review timeout: 1200s (DEFAULT_REVIEW_TIMEOUT) - Heartbeats every 15s (REVIEW_PROGRESS_HEARTBEAT_SECONDS) - Startup timeout waiting for Claude prompt in tmux: 60s - Stall warnings if output file size does not change while waiting - cbox output --check-stall for interactive prompt blockers

Precise definition: slow vs hung

Not hang: - Review duration <= configured timeout (--review-timeout, default 1200s) - Any of these progress signals observed: - heartbeat lines - tmux pane activity - .cbox/reviews/*.md created or growing

Suspected hang: - No pane activity and no review artifact growth for >= 120s

Confirmed hang: - Exceeds configured timeout without verdict OR - session dead/invalid with no deterministic error surfaced

So no, "> 10 minutes" alone is not a hang. 10-15 minutes can be normal.

Most likely friction patterns (from code + incidents)

  1. Visibility gap in delegated/sandbox orchestration - Review may be running, but the caller sees sparse output and assumes stall. - cbox output snapshots can miss in-between progress.

  2. Review output contract is file-based (## Verdict sentinel) - In-container path waits for review file + verdict text. - If model doesn't reach/write final verdict section, flow waits until timeout.

  3. Interactive prompt blockers in tmux - Workspace trust/effort prompts are auto-dismissed; other prompts may still block and need intervention. - These can look like "hung" unless stall check is run.

  4. Environment drift in sandbox worktrees - bootstrap warnings / dependency issues can degrade command reliability and confuse diagnosis.

Deterministic recovery + retry runbook

For sandbox &lt;name&gt;:

  1. Check interactive stall:
just cbox output --check-stall &lt;name&gt;
  1. Check pane activity:
tmux capture-pane -pt cbox-sandbox-&lt;name&gt; | tail -n 60
  1. Check review artifact freshness:
ls -lt .worktrees/&lt;name&gt;/.cbox/reviews
  1. If active progress exists: wait until timeout budget.

  2. If no progress for >=120s:

uv run cbox send --interrupt &lt;name&gt; &quot;retry: run cbox review --watch changes and report REVIEW_EXIT&quot;
  1. If retry still stalls: stop session and recreate sandbox from clean state.

  2. Escalate only after: - one clean retry attempt failed - timeout exceeded OR repeated no-progress window

Why this still happened operationally

I treated sparse/no stdout as hang too early in several runs, instead of strictly applying timeout + freshness criteria. That caused unnecessary manager takeover and noisy diagnosis.

Fix plan (implementation)

A) Improve runtime observability (P1)

Success criteria: - Operator can distinguish active vs stuck from one file without tmux attach.

B) Add explicit no-progress deadman + auto-retry (P1)

Success criteria: - Fewer manual interruptions/restarts for transient stalls.

C) Tighten stall classification and messaging (P1)

D) Add deterministic diagnostics bundle on timeout (P1)

E) Harden tests (P1)

Test matrix to close issue

Axes: - Execution path: in-container tmux vs host ephemeral container - Output mode: --watch on/off - Runner mode: interactive terminal vs delegated tool call - Failure injection: - no verdict ever - artifact grows then freezes - interactive prompt blocker - session drop

Pass bar: - 0 ambiguous "silent" outcomes without status classification - deterministic terminal state within timeout budget - auto-retry succeeds or exits with actionable diagnostics

Proposed near-term policy while patch lands