Dataface Tasks

cbox sandbox sessions can exit unexpectedly during long task handoff

IDINFRA_TOOLING-CBOX_SANDBOX_SESSIONS_CAN_EXIT_UNEXPECTEDLY_DURING_LONG_TASK_HANDOFF
Statuscompleted
Priorityp1
Milestonem1-ft-analytics-analyst-pilot
Ownersr-engineer-architect
Completed byCBox Agent
Completed2026-03-16

Problem

Named cbox sandboxes can do real work for minutes or hours and then disappear without a recorded shutdown reason. In the observed worktree2 runs on Friday, March 13, 2026, the sandbox successfully started, accepted prompts, modified files, and ran tests, but the tmux session and docker --rm container later vanished mid-task. The worktree preserved partial progress, but the manager had no durable exit code, no teardown reason, and no automated recovery beyond manually restarting the sandbox.

Context

  • Reproduced multiple times on sandbox worktree2 while it worked on the dft inspect catalog builder task.
  • Runtime logs under .worktrees/worktree2/.cbox/logs/runtime.log record startup success (sandbox_starting, sandbox_prompt_ready, sandbox_bootstrap_health) but do not record sandbox exit cause.
  • Because containers are launched with docker run --rm, the container is removed immediately on exit and postmortem logs/exit codes are lost unless captured before teardown.
  • The surviving worktree showed partial in-progress changes after each crash, confirming the sandbox was actively executing task work rather than sitting idle.
  • Related current work:
  • tasks/workstreams/infra-tooling/tasks/cbox-send-false-positive-delivery-when-sandbox-tui-ignores-input.md
  • recent TERM fix for interactive containers on branch codex/fix-cbox-startup-timeouts
  • Relevant implementation paths:
  • libs/cbox/cbox/cli.py sandbox lifecycle and runtime logging
  • libs/cbox/cbox/container.py container launch configuration
  • libs/cbox/cbox/tmux.py session creation/attach/send behavior

Possible Solutions

Record sandbox container exit details and final pane state before tmux/container teardown, persist them in runtime logs, and surface a clear reason when a session disappears. Pair that with a manager-friendly recovery loop so repeated crashes create actionable evidence instead of mystery.

  • Pros: Directly addresses the black-box failure mode, helps root-cause both container exits and TUI/input issues, and improves repeated recovery.
  • Cons: Requires careful handling of --rm containers and race conditions around short-lived exits.

B. Remove --rm for interactive sandbox containers

Keep exited containers around until explicit cleanup so operators can inspect docker logs and exit codes manually.

  • Pros: Easier postmortem debugging.
  • Cons: Leaves more container garbage behind and may complicate normal cleanup UX.

C. Focus only on automatic restart behavior

Detect a dropped sandbox and immediately recreate it without trying to capture why the first one died.

  • Pros: Smaller user-facing interruption.
  • Cons: Treats symptoms only and preserves the current observability gap.

Plan

Use approach A.

  1. Reproduce the repeated worktree2 sandbox-drop behavior using the saved logs and current launch path.
  2. Identify where sandbox sessions can disappear without emitting a final runtime-log event.
  3. Add durable exit diagnostics: - final pane capture, - container exit/teardown details when available, - explicit runtime-log events for sandbox stop/crash/drop.
  4. Surface that evidence in manager-facing commands so cbox list, cbox output, or startup recovery can explain what happened.
  5. Add focused tests around the new diagnostics/recovery path.

Implementation Progress

  • Task upgraded from backlog placeholder to active incident work after repeated worktree2 sandbox drops during an M1 task execution loop.
  • Confirmed current blind spot: runtime logs persist startup success but not sandbox death reason, and docker --rm removes the container before later inspection.

Exit diagnostics and runtime logging (done)

Added durable exit diagnostics so sandbox exits are observable in runtime logs:

  1. _collect_exit_diagnostics() in cbox/diagnostics.py — captures final tmux pane content and container exit code (via docker inspect --format '{% raw %}{{.State.ExitCode}}{% endraw %}') before teardown. Gracefully degrades when the container was already --rm'd (exit code = "unknown").

  2. _log_session_exit() in cbox/cli.py — reusable helper that resolves workdir from the tmux session path, finds the container by prefix, collects exit diagnostics, and logs a sandbox_stopped event with exit code and pane line count. Called before _cleanup_session_containers / tmux.kill_session so the pane and container are still inspectable.

  3. Wired into all stop paths: - cbox stop <name> — named stop - cbox stop --all (via _stop_sessions) - cbox new <name> --kill

  4. StartupTimeoutDiagnostics.session_exists field in cbox/health.py — distinguishes "session vanished during startup (container exited early)" from "prompt did not appear before timeout". The summary now shows an explicit failure mode label.

  5. _collect_startup_timeout_diagnostics() enhanced — checks tmux.session_exists first and skips pane capture when the session is already gone, avoiding misleading empty-pane output.

  6. Tests: 18 new/updated tests covering: - Exit diagnostics: pane capture, exit code capture, container-already-removed, timeout, no-runtime - Stop/kill logging: sandbox_stopped event emitted to runtime.log - Startup timeout: session_exists flag, "vanished" vs "timeout" summary text

Concurrent start / port collision investigation (done)

Investigated the fresh startup-timeout pattern: - Container names use secrets.token_hex(3) — no name collision risk between concurrent starts. - Port allocation uses a file-based lock (_port_registry_lock() via fcntl.flock) — no race condition between concurrent cbox new invocations. - _cleanup_session_containers() is called before every new container start, removing stale containers with the same session prefix. - Port collision is not a contributing factor; containers would fail with a docker name conflict (which is caught) rather than silently exiting.

Named-session drop diagnostics surfaced in send/output (done)

Fixed the last blind spot: cbox send <name> and cbox output <name> on a dead sandbox previously showed a generic "No session found" message with no diagnostic context. Now both route through _show_session_drop_and_exit(), producing: - Full SessionDropDiagnostics panel (cause, worktree state, container state, port allocation) - session_drop event logged to runtime.log in the worktree - Actionable recovery hint (cbox new <name> to resume)

This closes the manager's observability gap during long-running task handoffs — when cbox output worker1 discovers the session died, it now reports why and how to recover instead of a generic error.

Tests: 4 updated + 2 new tests: - test_send_named_session_not_found_shows_diagnostics — verifies diagnostics are collected and cause is displayed - test_output_named_session_not_found_shows_diagnostics — same for output path - test_output_named_sandbox_shows_recovery_hint — verifies worktree state and cbox new recovery command in output - test_send_named_sandbox_shows_recovery_hint — same for send path

Review Feedback

  • [ ] Review cleared