Dataface Tasks

Improve cbox recovery from hung in-session tool calls

IDM1-INFRA-020
Statuscompleted
Priorityp1
Milestonem1-ft-analytics-analyst-pilot
Ownerhead-of-engineering
Completed byCBox Agent
Completed2026-03-17

Problem

CBox sandboxes frequently get stuck in long-running shell tool calls (e.g., hung commit hooks, stalled uv sync, unresponsive git operations) where the Claude session becomes unresponsive to normal interaction. The manager's cbox send --interrupt does not reliably break these hung processes because a single Ctrl+C is often insufficient to terminate a deeply nested shell command. When this happens, the only recovery path is manual worktree takeover, which is slow, error-prone, and defeats the purpose of automated sandbox orchestration. This is one of the most common causes of sandbox workflow stalls.

Context

Key files: libs/cbox/cbox/tmux.py (session primitives), libs/cbox/cbox/cli.py (CLI surface), tests in libs/cbox/test_tmux.py, libs/cbox/test_interactive_stall.py, libs/cbox/test_send.py.

Prior state: send_keys(interrupt=True) sent 2× Ctrl+C. Interactive stall detection handled UI prompts (effort level, workspace trust, permission dialogs) but had no concept of "tool call in progress" vs "idle at prompt" — both returned CLEAR.

Possible Solutions

  1. Increase Ctrl+C count only — simple but doesn't help when subprocesses ignore SIGINT entirely.
  2. Add Escape key escalation + tool-call detection (Recommended) — Ctrl+C targets the subprocess; Escape targets Claude Code's own tool-cancellation mechanism. Combined with detecting "tool call active" state, the manager gets actionable status and a two-phase interrupt. Low risk (Escape is a no-op when not in a tool call).
  3. Kill tmux pane and restart — nuclear option, loses session context. Reserved for manual last resort.

Plan

Files modified: - libs/cbox/cbox/tmux.py — Add INTERRUPT_ESCAPE_COUNT, Escape key phase in send_keys(interrupt=True), detect_tool_call_active(). - libs/cbox/cbox/cli.py — Add TOOL_ACTIVE to InteractiveStallStatus, integrate into check_interactive_stall(), update output --check-stall and send undelivered handler with worktree-takeover guidance. - Test files — TDD: tests written before implementation.

Implementation Progress

  • [x] Detect non-progressing in-session tool calls in sandbox output.
  • [x] Make cbox send --interrupt reliably break active hung calls.
  • [x] Provide manager guidance when manual worktree takeover is the only fallback.

  • [x] Hung commit/hook/install phases are recoverable through cbox controls.

  • [x] Recovery path is documented and test-covered where feasible.

  • Source issue: m1-infra-016-docs-sync stalled during commit and ignored interrupt.

  • 2026-03-05: Implemented multi-Ctrl+C interrupt in tmux.send_keys(..., interrupt=True) and added tests in libs/cbox/test_tmux.py and libs/cbox/test_send.py.
  • 2026-03-17: Added Escape-key escalation after Ctrl+C in send_keys(interrupt=True). Added detect_tool_call_active() to distinguish "tool running" from "at prompt". Added TOOL_ACTIVE status to InteractiveStallStatus. cbox output --check-stall now reports tool-active state (exit 3) with interrupt/takeover guidance. cbox send --interrupt undelivered path now suggests worktree take-over. 12 new tests across 3 files, all passing.

Review Feedback

  • [ ] Review cleared