Dataface Tasks

Reduce cbox sandbox startup latency by parallelizing health checks

IDM1-INFRA-015
Statuscompleted
Priorityp1
Milestonem1-ft-analytics-analyst-pilot
Ownerhead-of-engineering
Completed byCBox Agent
Completed2026-03-16

Problem

Sandbox bootstrap health checks (python, pre-commit, uv, git_auth) ran sequentially via list comprehension, making worst-case wall time 4×15s=60s. Replaced with ThreadPoolExecutor to run all probes concurrently; worst case now ~15s.

_run_container_health_checks() in cbox/cli.py executed 4 independent docker exec probes one after another. Each has a 15s timeout. When multiple probes are slow (common in cold-start sandboxes), startup delay compounds.

Context

  • Primary file: libs/cbox/cbox/cli.py_run_container_health_checks() (lines 1255-1339)
  • Types: libs/cbox/cbox/health.pyHealthCheckResult dataclass
  • Tests: libs/cbox/test_sandbox_start.py — 7 health-check unit tests
  • Constraint: _run_probe is a pure function (no shared mutable state), making it inherently thread-safe.
  • Import: concurrent.futures.ThreadPoolExecutor at module level (line 16 of cli.py).

Possible Solutions

  1. ThreadPoolExecutor — standard library, minimal change, futures preserve insertion order. ✓
  2. asyncio.gather — would require converting the entire call chain to async. Overkill.
  3. multiprocessing.Pool — unnecessary process overhead for I/O-bound subprocess calls.

Plan

Option 1: concurrent.futures.ThreadPoolExecutor with max_workers=len(checks). Submit all probes, collect results via f.result() in definition order. Three-line change in _run_container_health_checks().

Implementation Progress

  • Run health checks (python, pre-commit, uv, git_auth) concurrently.
  • Preserve deterministic output ordering in CLI rendering.
  • Keep per-check timeout and remediation behavior unchanged.
  • Add/adjust tests to validate concurrency-safe behavior.

  • Worst-case health check wall-clock time is near single-check timeout, not sum of all check timeouts.

  • Existing warning panel output remains stable and actionable.
  • cbox tests pass.

  • Source issue: repeated 6-12 minute reviews plus additional startup delay from sequential checks.

  • Changed the return line from [_run_probe(...) for ... in checks] to ThreadPoolExecutor.submit() + [f.result() for f in futures].

  • Order preserved because futures list matches checks list order.
  • Per-check 15s timeout unchanged (each thread still calls subprocess.run(timeout=15)).
  • _run_probe was already a pure function with no shared mutable state — thread-safe by design.
  • Added two tests: test_container_health_checks_run_concurrently (timing) and test_container_health_checks_preserve_order (deterministic output).

Review Feedback

Round 1 — cbox review (opus), 2026-03-08

  • Verdict: APPROVED
  • MEDIUM-1: Move concurrent.futures import to module level → Fixed.
  • MEDIUM-2: Order test doesn't exercise out-of-order completion → Fixed (added staggered delays).

  • [x] Review cleared