cbox sandbox sessions can exit unexpectedly during long task handoff

ID	INFRA_TOOLING-CBOX_SANDBOX_SESSIONS_CAN_EXIT_UNEXPECTEDLY_DURING_LONG_TASK_HANDOFF
Status	completed
Priority	p1
Milestone	m1-ft-analytics-analyst-pilot
Owner	sr-engineer-architect
Completed by	CBox Agent
Completed	2026-03-16

Problem

Named cbox sandboxes can do real work for minutes or hours and then disappear without a recorded shutdown reason. In the observed worktree2 runs on Friday, March 13, 2026, the sandbox successfully started, accepted prompts, modified files, and ran tests, but the tmux session and docker --rm container later vanished mid-task. The worktree preserved partial progress, but the manager had no durable exit code, no teardown reason, and no automated recovery beyond manually restarting the sandbox.

Context

Reproduced multiple times on sandbox worktree2 while it worked on the dft inspect catalog builder task.
Runtime logs under .worktrees/worktree2/.cbox/logs/runtime.log record startup success (sandbox_starting, sandbox_prompt_ready, sandbox_bootstrap_health) but do not record sandbox exit cause.
Because containers are launched with docker run --rm, the container is removed immediately on exit and postmortem logs/exit codes are lost unless captured before teardown.
The surviving worktree showed partial in-progress changes after each crash, confirming the sandbox was actively executing task work rather than sitting idle.
Related current work:
tasks/workstreams/infra-tooling/tasks/cbox-send-false-positive-delivery-when-sandbox-tui-ignores-input.md
recent TERM fix for interactive containers on branch codex/fix-cbox-startup-timeouts
Relevant implementation paths:
libs/cbox/cbox/cli.py sandbox lifecycle and runtime logging
libs/cbox/cbox/container.py container launch configuration
libs/cbox/cbox/tmux.py session creation/attach/send behavior

Possible Solutions

A. Add durable sandbox exit diagnostics and manager-facing recovery hooks — Recommended

Record sandbox container exit details and final pane state before tmux/container teardown, persist them in runtime logs, and surface a clear reason when a session disappears. Pair that with a manager-friendly recovery loop so repeated crashes create actionable evidence instead of mystery.

Pros: Directly addresses the black-box failure mode, helps root-cause both container exits and TUI/input issues, and improves repeated recovery.
Cons: Requires careful handling of --rm containers and race conditions around short-lived exits.

B. Remove `--rm` for interactive sandbox containers

Keep exited containers around until explicit cleanup so operators can inspect docker logs and exit codes manually.

Pros: Easier postmortem debugging.
Cons: Leaves more container garbage behind and may complicate normal cleanup UX.

C. Focus only on automatic restart behavior

Detect a dropped sandbox and immediately recreate it without trying to capture why the first one died.

Pros: Smaller user-facing interruption.
Cons: Treats symptoms only and preserves the current observability gap.

Plan

Use approach A.

Reproduce the repeated worktree2 sandbox-drop behavior using the saved logs and current launch path.
Identify where sandbox sessions can disappear without emitting a final runtime-log event.
Add durable exit diagnostics: - final pane capture, - container exit/teardown details when available, - explicit runtime-log events for sandbox stop/crash/drop.
Surface that evidence in manager-facing commands so cbox list, cbox output, or startup recovery can explain what happened.
Add focused tests around the new diagnostics/recovery path.

Implementation Progress

Task upgraded from backlog placeholder to active incident work after repeated worktree2 sandbox drops during an M1 task execution loop.
Confirmed current blind spot: runtime logs persist startup success but not sandbox death reason, and docker --rm removes the container before later inspection.

Exit diagnostics and runtime logging (done)

Added durable exit diagnostics so sandbox exits are observable in runtime logs:

_collect_exit_diagnostics() in cbox/diagnostics.py — captures final tmux pane content and container exit code (via docker inspect --format '{% raw %}{{.State.ExitCode}}{% endraw %}') before teardown. Gracefully degrades when the container was already --rm'd (exit code = "unknown").
_log_session_exit() in cbox/cli.py — reusable helper that resolves workdir from the tmux session path, finds the container by prefix, collects exit diagnostics, and logs a sandbox_stopped event with exit code and pane line count. Called before _cleanup_session_containers / tmux.kill_session so the pane and container are still inspectable.
Wired into all stop paths: - cbox stop <name> — named stop - cbox stop --all (via _stop_sessions) - cbox new <name> --kill
StartupTimeoutDiagnostics.session_exists field in cbox/health.py — distinguishes "session vanished during startup (container exited early)" from "prompt did not appear before timeout". The summary now shows an explicit failure mode label.
_collect_startup_timeout_diagnostics() enhanced — checks tmux.session_exists first and skips pane capture when the session is already gone, avoiding misleading empty-pane output.
Tests: 18 new/updated tests covering: - Exit diagnostics: pane capture, exit code capture, container-already-removed, timeout, no-runtime - Stop/kill logging: sandbox_stopped event emitted to runtime.log - Startup timeout: session_exists flag, "vanished" vs "timeout" summary text

Concurrent start / port collision investigation (done)

Investigated the fresh startup-timeout pattern: - Container names use secrets.token_hex(3) — no name collision risk between concurrent starts. - Port allocation uses a file-based lock (_port_registry_lock() via fcntl.flock) — no race condition between concurrent cbox new invocations. - _cleanup_session_containers() is called before every new container start, removing stale containers with the same session prefix. - Port collision is not a contributing factor; containers would fail with a docker name conflict (which is caught) rather than silently exiting.

Named-session drop diagnostics surfaced in send/output (done)

Fixed the last blind spot: cbox send <name> and cbox output <name> on a dead sandbox previously showed a generic "No session found" message with no diagnostic context. Now both route through _show_session_drop_and_exit(), producing: - Full SessionDropDiagnostics panel (cause, worktree state, container state, port allocation) - session_drop event logged to runtime.log in the worktree - Actionable recovery hint (cbox new <name> to resume)

This closes the manager's observability gap during long-running task handoffs — when cbox output worker1 discovers the session died, it now reports why and how to recover instead of a generic error.

Tests: 4 updated + 2 new tests: - test_send_named_session_not_found_shows_diagnostics — verifies diagnostics are collected and cause is displayed - test_output_named_session_not_found_shows_diagnostics — same for output path - test_output_named_sandbox_shows_recovery_hint — verifies worktree state and cbox new recovery command in output - test_send_named_sandbox_shows_recovery_hint — same for send path

Review Feedback

[ ] Review cleared