Dataface Tasks

Configurable review timeouts and stall detection

IDM1-INFRA-007
Statusdone
Priorityp1
Milestonem1-ft-analytics-analyst-pilot
Ownerhead-of-engineering

Problem

CBox reviews have a fixed 10-minute timeout that is both too short for complex reviews (which routinely take 6-12 minutes for large PRs) and provides no visibility into whether a review is making progress or truly stuck. When the timeout fires, the operator cannot distinguish between a review that was actively writing output and one that stalled at startup. This leads to premature termination of valid reviews and, conversely, long waits for reviews that silently failed early. Without configurable timeouts and stall detection, operators have no tooling to tune review behavior per-environment or diagnose hung reviews without manually attaching to tmux.

Context

Possible Solutions

Plan

Implementation Progress

  • DEFAULT_REVIEW_TIMEOUT constant (1200s / 20 minutes).
  • --review-timeout CLI flag on cbox review.
  • CBOX_REVIEW_TIMEOUT env var (CLI flag takes priority).
  • _resolve_review_timeout() — resolves CLI > env > default, validates > 0.
  • StallStatus enum and _check_review_stall() — file-growth based stall detection.
  • Tmux polling loop uses stall detection and prints diagnostic warnings.
  • Ephemeral container timeout message includes remediation hint.
  • Tests for all timeout resolution paths and stall detection logic.

  • Long reviews can run past the old 10-minute boundary via --review-timeout or env var.

  • Clearly stalled reviews surface warnings before the hard timeout fires.
  • CLI output shows whether review is progressing or blocked.
  • All new code covered by tests (10 new tests, 28 total in test_review_cli.py).

Implementation notes

Timeout resolution priority

  1. --review-timeout <seconds> CLI flag (highest)
  2. CBOX_REVIEW_TIMEOUT env var
  3. DEFAULT_REVIEW_TIMEOUT (1200s)

Invalid env var values are warned and ignored. Zero/negative values raise click.BadParameter.

Stall detection

  • StallStatus.ALIVE — output file grew since last check
  • StallStatus.STALLED — output file unchanged
  • StallStatus.NO_OUTPUT — no output file yet (too early to judge)

Tmux polling checks every 3s; warns every ~15s of stall. Ephemeral container path uses heartbeat messages (every 30s) showing elapsed time.

Files changed

  • libs/cbox/cbox/cli.py — core implementation
  • libs/cbox/test_review_cli.py — 10 new tests

  • None (self-contained change).

Review Feedback

  • [ ] Review cleared