Dataface Tasks

CBox sandbox session liveness drop detection and recovery

IDINFRA_TOOLING-CBOX_SANDBOX_SESSION_LIVENESS_DROP_DETECTION_AND_RECOVERY
Statusdone
Priorityp1
Milestonem1-ft-analytics-analyst-pilot
Ownerhead-of-engineering

Problem

CBox sandbox sessions can silently disappear ("No session found") while the underlying worktree and branch remain intact. When this happens, cbox send and cbox output return a generic error with no diagnostic information — the operator cannot determine whether the container exited, the tmux session was killed, or the entire sandbox was cleaned up. Without structured cause reporting or recovery guidance, the manager agent resorts to manual inspection (attaching to tmux, checking Docker state, listing worktrees) to figure out what happened and how to resume. This lack of session-drop diagnostics is a major gap in cbox observability for long-running sandbox workflows.

Context

Possible Solutions

Plan

Implementation Progress

Implementation summary

Problem

When a sandbox session disappeared ("No session found"), cbox only printed a generic error with no diagnostics. Operators had to manually attach/inspect to understand why the session dropped and how to recover.

Solution

Added SessionDropDiagnostics dataclass and _diagnose_session_drop() to collect structured diagnostics when a session is expected but missing:

  • Worktree state: checks if .worktrees/<name>/ still exists
  • Port allocation: checks registry for stale port entries
  • Container state: inspects Docker container running/exited status
  • Cause classification: container_exited, session_killed, fully_cleaned
  • Recovery guidance: context-aware hints (resume vs fresh start vs restart)
  • Runtime logging: session_drop events logged to .cbox/logs/runtime.log

Files changed

  • libs/cbox/cbox/health.py — added SessionDropDiagnostics dataclass
  • libs/cbox/cbox/cli.py — added _diagnose_session_drop(), enhanced send, output commands with diagnostic panels
  • libs/cbox/test_session_drop_diagnostics.py — 18 tests covering dataclass, diagnostics function, CLI integration, and logging

User-visible behavior

cbox send <name> and cbox output <name> now show a "Session Drop Diagnostics" panel when the session is missing, including cause, worktree status, container state, and recovery commands.

  • No cross-workstream dependencies; builds on existing diagnostics infra from cbox-sandbox-bootstrap-and-auth-health-checks and cbox-session-registry-stale-after-sandbox-kill.
  • Follow-up: consider periodic liveness probes for long-running sandboxes (not needed for M1 — on-demand diagnostics sufficient).

Review Feedback

  • [ ] Review cleared