CBox Review "Hang" Analysis (March 5, 2026)

Executive summary

cbox review has two behaviors that were conflated:

Slow-but-healthy review: can legitimately run 6-15 minutes (sometimes longer).
True hang/stall: no meaningful progress and/or no completion beyond expected timeout.

The system currently has heartbeats and timeout guards, but in sandbox-driven flows the operator signal is still ambiguous enough that "silent" can be mistaken for "hung".

What code does today

Primary paths: - Inside container/sandbox: cbox review ... -> _run_review_in_tmux(...) - Host: cbox review ... -> ephemeral container + _run_subprocess_with_heartbeat(...)

Current guardrails already present: - Default review timeout: 1200s (DEFAULT_REVIEW_TIMEOUT) - Heartbeats every 15s (REVIEW_PROGRESS_HEARTBEAT_SECONDS) - Startup timeout waiting for Claude prompt in tmux: 60s - Stall warnings if output file size does not change while waiting - cbox output --check-stall for interactive prompt blockers

Precise definition: slow vs hung

Not hang: - Review duration <= configured timeout (--review-timeout, default 1200s) - Any of these progress signals observed: - heartbeat lines - tmux pane activity - .cbox/reviews/*.md created or growing

Suspected hang: - No pane activity and no review artifact growth for >= 120s

Confirmed hang: - Exceeds configured timeout without verdict OR - session dead/invalid with no deterministic error surfaced

So no, "> 10 minutes" alone is not a hang. 10-15 minutes can be normal.

Most likely friction patterns (from code + incidents)

Visibility gap in delegated/sandbox orchestration - Review may be running, but the caller sees sparse output and assumes stall. - cbox output snapshots can miss in-between progress.
Review output contract is file-based (## Verdict sentinel) - In-container path waits for review file + verdict text. - If model doesn't reach/write final verdict section, flow waits until timeout.
Interactive prompt blockers in tmux - Workspace trust/effort prompts are auto-dismissed; other prompts may still block and need intervention. - These can look like "hung" unless stall check is run.
Environment drift in sandbox worktrees - bootstrap warnings / dependency issues can degrade command reliability and confuse diagnosis.

Deterministic recovery + retry runbook

For sandbox <name>:

Check interactive stall:

just cbox output --check-stall &lt;name&gt;

Check pane activity:

tmux capture-pane -pt cbox-sandbox-&lt;name&gt; | tail -n 60

Check review artifact freshness:

ls -lt .worktrees/&lt;name&gt;/.cbox/reviews

If active progress exists: wait until timeout budget.
If no progress for >=120s:

uv run cbox send --interrupt &lt;name&gt; &quot;retry: run cbox review --watch changes and report REVIEW_EXIT&quot;

If retry still stalls: stop session and recreate sandbox from clean state.
Escalate only after: - one clean retry attempt failed - timeout exceeded OR repeated no-progress window

Why this still happened operationally

I treated sparse/no stdout as hang too early in several runs, instead of strictly applying timeout + freshness criteria. That caused unnecessary manager takeover and noisy diagnosis.

Fix plan (implementation)

A) Improve runtime observability (P1)

Add a structured review state file, e.g. .cbox/reviews/.state-<run_id>.json
Update phase transitions: starting, prompt_ready, prompt_sent, artifact_seen, verdict_seen, timeout, error
Emit monotonic heartbeat counters and last-activity timestamps.

Success criteria: - Operator can distinguish active vs stuck from one file without tmux attach.

B) Add explicit no-progress deadman + auto-retry (P1)

If no pane change AND no artifact growth for configurable window (default 120s), abort current run with diagnostic + optional single auto-retry.
Record both attempts with unique run ids.

Success criteria: - Fewer manual interruptions/restarts for transient stalls.

C) Tighten stall classification and messaging (P1)

Surface status classes clearly in CLI output: RUNNING, SUSPECT_STALL, TIMEOUT, FAILED, COMPLETE.
Make cbox output show latest status class and elapsed/timeout ratio.

D) Add deterministic diagnostics bundle on timeout (P1)

On timeout: save pane tail, session metadata, container inspect/log snippets, and artifact state snapshot.

E) Harden tests (P1)

Unit tests:
no-progress classification
deadman triggers
retry path
status file lifecycle
Integration tests:
fake review writer that delays/stalls/verdicts
sandbox path + host path parity
Reliability test harness:
N-run loop measuring p50/p95 runtime and failure modes

Test matrix to close issue

Axes: - Execution path: in-container tmux vs host ephemeral container - Output mode: --watch on/off - Runner mode: interactive terminal vs delegated tool call - Failure injection: - no verdict ever - artifact grows then freezes - interactive prompt blocker - session drop

Pass bar: - 0 ambiguous "silent" outcomes without status classification - deterministic terminal state within timeout budget - auto-retry succeeds or exits with actionable diagnostics

Proposed near-term policy while patch lands

Do not classify as hang before either:
no-progress window >=120s, or
timeout threshold hit.
Require cbox output --check-stall + artifact freshness check before interrupt/restart.

tasks/logs/cbox-review-hang-analysis-2026-03-05.md