cbox sandbox sessions can exit unexpectedly during long task handoff
Problem
Named cbox sandboxes can do real work for minutes or hours and then disappear without a recorded shutdown reason. In the observed worktree2 runs on Friday, March 13, 2026, the sandbox successfully started, accepted prompts, modified files, and ran tests, but the tmux session and docker --rm container later vanished mid-task. The worktree preserved partial progress, but the manager had no durable exit code, no teardown reason, and no automated recovery beyond manually restarting the sandbox.
Context
- Reproduced multiple times on sandbox
worktree2while it worked on thedft inspectcatalog builder task. - Runtime logs under
.worktrees/worktree2/.cbox/logs/runtime.logrecord startup success (sandbox_starting,sandbox_prompt_ready,sandbox_bootstrap_health) but do not record sandbox exit cause. - Because containers are launched with
docker run --rm, the container is removed immediately on exit and postmortem logs/exit codes are lost unless captured before teardown. - The surviving worktree showed partial in-progress changes after each crash, confirming the sandbox was actively executing task work rather than sitting idle.
- Related current work:
tasks/workstreams/infra-tooling/tasks/cbox-send-false-positive-delivery-when-sandbox-tui-ignores-input.md- recent TERM fix for interactive containers on branch
codex/fix-cbox-startup-timeouts - Relevant implementation paths:
libs/cbox/cbox/cli.pysandbox lifecycle and runtime logginglibs/cbox/cbox/container.pycontainer launch configurationlibs/cbox/cbox/tmux.pysession creation/attach/send behavior
Possible Solutions
A. Add durable sandbox exit diagnostics and manager-facing recovery hooks — Recommended
Record sandbox container exit details and final pane state before tmux/container teardown, persist them in runtime logs, and surface a clear reason when a session disappears. Pair that with a manager-friendly recovery loop so repeated crashes create actionable evidence instead of mystery.
- Pros: Directly addresses the black-box failure mode, helps root-cause both container exits and TUI/input issues, and improves repeated recovery.
- Cons: Requires careful handling of
--rmcontainers and race conditions around short-lived exits.
B. Remove --rm for interactive sandbox containers
Keep exited containers around until explicit cleanup so operators can inspect docker logs and exit codes manually.
- Pros: Easier postmortem debugging.
- Cons: Leaves more container garbage behind and may complicate normal cleanup UX.
C. Focus only on automatic restart behavior
Detect a dropped sandbox and immediately recreate it without trying to capture why the first one died.
- Pros: Smaller user-facing interruption.
- Cons: Treats symptoms only and preserves the current observability gap.
Plan
Use approach A.
- Reproduce the repeated
worktree2sandbox-drop behavior using the saved logs and current launch path. - Identify where sandbox sessions can disappear without emitting a final runtime-log event.
- Add durable exit diagnostics: - final pane capture, - container exit/teardown details when available, - explicit runtime-log events for sandbox stop/crash/drop.
- Surface that evidence in manager-facing commands so
cbox list,cbox output, or startup recovery can explain what happened. - Add focused tests around the new diagnostics/recovery path.
Implementation Progress
- Task upgraded from backlog placeholder to active incident work after repeated
worktree2sandbox drops during an M1 task execution loop. - Confirmed current blind spot: runtime logs persist startup success but not sandbox death reason, and
docker --rmremoves the container before later inspection.
Exit diagnostics and runtime logging (done)
Added durable exit diagnostics so sandbox exits are observable in runtime logs:
-
_collect_exit_diagnostics()incbox/diagnostics.py— captures final tmux pane content and container exit code (viadocker inspect --format '{% raw %}{{.State.ExitCode}}{% endraw %}') before teardown. Gracefully degrades when the container was already--rm'd (exit code = "unknown"). -
_log_session_exit()incbox/cli.py— reusable helper that resolves workdir from the tmux session path, finds the container by prefix, collects exit diagnostics, and logs asandbox_stoppedevent with exit code and pane line count. Called before_cleanup_session_containers/tmux.kill_sessionso the pane and container are still inspectable. -
Wired into all stop paths: -
cbox stop <name>— named stop -cbox stop --all(via_stop_sessions) -cbox new <name> --kill -
StartupTimeoutDiagnostics.session_existsfield incbox/health.py— distinguishes "session vanished during startup (container exited early)" from "prompt did not appear before timeout". The summary now shows an explicit failure mode label. -
_collect_startup_timeout_diagnostics()enhanced — checkstmux.session_existsfirst and skips pane capture when the session is already gone, avoiding misleading empty-pane output. -
Tests: 18 new/updated tests covering: - Exit diagnostics: pane capture, exit code capture, container-already-removed, timeout, no-runtime - Stop/kill logging:
sandbox_stoppedevent emitted to runtime.log - Startup timeout:session_existsflag, "vanished" vs "timeout" summary text
Concurrent start / port collision investigation (done)
Investigated the fresh startup-timeout pattern:
- Container names use secrets.token_hex(3) — no name collision risk between concurrent starts.
- Port allocation uses a file-based lock (_port_registry_lock() via fcntl.flock) — no race condition between concurrent cbox new invocations.
- _cleanup_session_containers() is called before every new container start, removing stale containers with the same session prefix.
- Port collision is not a contributing factor; containers would fail with a docker name conflict (which is caught) rather than silently exiting.
Named-session drop diagnostics surfaced in send/output (done)
Fixed the last blind spot: cbox send <name> and cbox output <name> on a dead sandbox previously showed a generic "No session found" message with no diagnostic context. Now both route through _show_session_drop_and_exit(), producing:
- Full SessionDropDiagnostics panel (cause, worktree state, container state, port allocation)
- session_drop event logged to runtime.log in the worktree
- Actionable recovery hint (cbox new <name> to resume)
This closes the manager's observability gap during long-running task handoffs — when cbox output worker1 discovers the session died, it now reports why and how to recover instead of a generic error.
Tests: 4 updated + 2 new tests:
- test_send_named_session_not_found_shows_diagnostics — verifies diagnostics are collected and cause is displayed
- test_output_named_session_not_found_shows_diagnostics — same for output path
- test_output_named_sandbox_shows_recovery_hint — verifies worktree state and cbox new recovery command in output
- test_send_named_sandbox_shows_recovery_hint — same for send path
Review Feedback
- [ ] Review cleared