infra tooling
Purpose
Own the developer operating layer: local/CI dev ergonomics, command surface cleanup, and deployment pathways that make Dataface reliably build, review, and ship. This workstream focuses on operational quality of the toolchain itself rather than product features.
Owner
- Sr Engineer Architect
Tasks by Milestone
A runnable prototype path exists for developer tooling, local workflow reliability, and deployment execution safety, with concrete artifacts that prove the flow works end-to-end in the current codebase. Core assumptions are documented, known constraints are explicit, and the team can explain what is real versus mocked without ambiguity.
- Prototype milestone closure and proof of readiness — Own the prototype milestone closure artifact: confirm end-to-end proof, capture remaining gaps, assign owners, and reco…
Internal analysts and engineers can execute at least one weekly real workflow that depends on developer tooling, local workflow reliability, and deployment execution safety in the 5T Analytics environment, without bespoke engineering intervention for every run. Instrumentation and feedback capture are in place so failures, friction points, and adoption gaps are visible and triaged with owners.
- Harden cbox sandbox bootstrap with PATH, pre-commit, and git auth health checks Done — Eliminate recurring sandbox runtime failures by standardizing environment bootstrap and adding startup health checks fo…
- Host Dataface on Fivetran GCP CompletedPR #733PR at 2026-03-23T11:08:04-07:00 — Stand up and harden the canonical GCP runtime path for internal pilot usage, including deploy auth hardening and intern…
- M1 pilot closure, operations, and go-no-go — Own the M1 closure artifact: verify exit-criteria evidence, establish pilot ops cadence and runbook/checklist, track op…
- Move task ready queue intent out of frontmatter and add kickoff flow — Eliminate merge-conflict churn caused by using task frontmatter status=ready as the queue signal. Design and implement…
- Add full-name output mode for cbox list Done — Prevent manager confusion from truncated session names by supporting a full-width/raw list mode.
- Add lightweight qa-explorer verification artifacts and trace capture Completed — Upgrade qa-explorer local visual verification with lightweight, ephemeral, gitignored per-run artifacts instead of repo…
- Add per-worktree local port bundles for dispatch QA Completed — Allocate unique local serve ports for each dispatched worktree, write a worktree-local ports file similar to .cbox-port…
- Add startup and post-pull task-manager reconcile hooks CompletedPR #847PR at 2026-03-27T08:09:29-07:00 — Add a second-stage reconciliation command/hook that runs at manager startup and after git pull to prune stale register…
- Add strict team owner validation and team listing to plans CLI CompletedPR #769PR at 2026-03-24T20:20:23-07:00 — Require task and initiative owners to resolve to a known person or role under tasks/team, with clear errors for unknown…
- Add task-manager reconciliation and cleanup passes for stale register and metadata drift Completed — Add periodic reconciliation so the task manager checks for stale register entries, completed-but-unreconciled worktrees…
- Add tasks serve CLI and local server with heartbeat or task-manager status UI CompletedPR #712PR at 2026-03-22T14:05:13-07:00 — Extend the plans or tasks CLI with a serve subcommand that runs a lightweight local server: render master plan markdown…
- Align cbox bootstrap/health docs with actual commands Done — Fix canonical and wrapper docs where bootstrap and git auth health-check commands diverge from implementation.
- Auto-reconcile merged PRs and orphaned in-progress tasks in task manager CompletedPR #835PR at 2026-03-26T23:48:26-07:00 — Extend task-manager cleanup so heartbeat/reconcile can automatically mark tasks completed when their PR is merged, demo…
- Build tmux task manager orchestration loop and task metadata CompletedPR #693PR at 2026-03-18T21:51:30-07:00 — Add task readiness/dependencies/timing metadata, a tmux-hosted Claude manager, and a heartbeat/register flow for host-s…
- cbox manager default parent-branch policy Done — Make manager-launched sandboxes default to the manager's active branch (not main) unless explicitly overridden with --p…
- CBox manager interactive stall detection and recovery Done — Detect and recover manager flows stuck at interactive prompts (e.g. /pr-lite menus, blocker prompts) with deterministic…
- CBox Process: diagnose and hard-fail silent cbox review failures in sandboxes Done — Observed in M1-INFRA-027 forensic run: sandbox had CBOX_CONTAINER=1 so cbox review should use _run_review_in_tmux, but…
- CBox Process: hard-block PR when cbox review runtime is missing Done — Observed during M1-INFRA-027: sandbox /pr flow offered 'skip review, open PR' when Docker/Podman missing, leading to PR…
- CBox review prompt context isolation on sandbox restart Completed — Observed restart path where sandbox opened with stale review prompt context (.cbox/.review-prompt.md flow). Ensure sand…
- CBox sandbox bootstrap health parity for python and pre-commit Completed — Repeatedly observed on fresh sandbox start: bootstrap health checks fail for python and pre-commit immediately after se…
- CBox sandbox git metadata path isolation Done — Fix sandbox git commands failing because worktree metadata points to host paths (e.g., '/Users/.../.git/packed-refs').…
- CBox sandbox session liveness drop detection and recovery Done — Observed manager incident: sandbox session disappeared ('No session found') while worktree/branch remained intact. Add…
- cbox sandbox sessions can exit unexpectedly during long task handoff Completed — Track issue 423 in tasks after retiring GitHub Issues as the active backlog.
- CBox sandbox startup-timeout diagnostics Done — Surface actionable diagnostics when wait_for_prompt times out during sandbox or review startup, replacing opaque "Timeo…
- cbox send false-positive delivery when sandbox TUI ignores input Completed — Investigate and fix cases where cbox send reports success after tmux send-keys, but the target sandbox Claude TUI does…
- CBox session registry stale after sandbox kill Done — Observed during manager cleanup: 'cbox new --list' continued showing a killed sandbox while 'cbox list' showed no sessi…
- CBox setup-worktree ROOT_WORKTREE_PATH fallback Completed — Harden worktree setup so cp from root .env succeeds when ROOT_WORKTREE_PATH is unset by deriving the root path from git…
- Configurable review timeouts and stall detection Done — Add configurable review timeout (CLI flag + env var) with 20m default, and stall detection that distinguishes slow-but-…
- Converge plans and plan commands to a single task CLI surface CompletedPR #770PR at 2026-03-24T11:46:22-07:00 — Rename just plan task/initiative to just task/initiative, rename .claude/commands plan-task and plan-initiative to task…
- Decouple task workflow from cbox CLI — add /cbox-task command surface Done — Keep core cbox generic and make task workflow optional via a composable skill/command layer.
- Detect open PR conflicts after reconcile and auto-rebase safe task branches Completed — Extend task-manager reconcile so it discovers open PR-backed tasks even when root frontmatter is stale, probes mergeabi…
- Finish visual workflow migration to local review and remove GitHub approval gate CompletedPR #788PR at 2026-03-24T22:54:12-07:00 — Complete the unfinished visual-regression workflow migration. Make local/mac review the source of truth, remove the sep…
- Fix concurrent worktree port allocation collisions Completed — Investigate and fix the per-worktree port allocator so concurrently created fresh worktrees never reuse the same cloud_…
- Fix qa-explore delegated run stall before summary handoff Completed — Reproduce why scripts/qa-explore launches the inner delegated browser worker and captures artifacts but often stalls be…
- Generalize same-agent repair for conflicted, CI-red, and unreconciled task PRs CompletedPR #849PR at 2026-03-27T09:08:25-07:00 — Extend the agent-before-human escalation pattern from conflicted PRs to also cover CI-red and dispatch-completed-unreco…
- Harden worker continuity and remove tmux manager surface Completed — Prevent multiple workers from clashing on a single task worktree, add first-class continuation/update workflow for the…
- Implement two-tier task manager escalation watchdog and optional manager CompletedPR #711PR at 2026-03-22T13:02:07-07:00 — Tier 1: heartbeat and scripts handle dispatch, snapshots, and mechanical recovery without a standing Claude manager. Ti…
- Improve cbox recovery from hung in-session tool calls Completed — Ensure manager interrupt/send can reliably recover sandboxes stuck in long-running shell tool calls.
- Improve tasks serve status UI for operator triage Completed — Replace raw JSON blocks on /status with a simple, readable operator UI: top-level health cards, escalation table, ready…
- Make cbox entrypoint bootstrap timeout configurable Done — Replace hard-coded bootstrap timeout values with environment-configurable settings and sane defaults.
- Make qa-explore default to dangerous mode for automation Completed — Default qa-explore to dangerous mode, keep a safe opt-in, add tests, and update docs so worker QA does not stall on Pla…
- Make qa-explorer use local browser subagent without cbox fallback Completed — Make qa-explorer run through the local subagent/browser path instead of any cbox fallback, ensure the browser automatio…
- Make task manager route tasks by role-owner mapping CompletedPR #766PR at 2026-03-24T10:26:52-07:00 — When a tasks owner is a role slug (e.g. data-viz-designer-engineer), resolve it through the role definition files in ta…
- Make tasks server status page use fresh heartbeat data and auto-refresh CompletedPR #764PR at 2026-03-24T10:19:32-07:00 — Fix stale /status behavior so operators see current task-manager state without manual restarts or owner confusion. Ensu…
- Master Plans CLI ergonomics and command wrappers Done — Make tasks task tooling easier to run than raw python invocation by adding a user-friendly command entrypoint, concise…
- Move task-manager claim and start state out of task frontmatter into manager registry CompletedPR #779PR at 2026-03-24T20:40:12-07:00 — Stop relying on task frontmatter mutations like started_at and in_progress claims for manager-driven dispatch. Keep tas…
- Normalize task manager and team identities to canonical dave and rj slugs Completed — Rename team member files from davefowler.md/rj-andrews.md to dave.md/rj.md, update _team_identity_slug() to return nick…
- Plan activity page performance improvements with incremental GitHub sync and caching CompletedPR #765PR at 2026-03-24T20:20:22-07:00 — Create an implementation plan to reduce tasks activity page load time, currently around 10s. Investigate bottlenecks in…
- Prevent cbox sandboxes from mutating host git common-dir Completed — Sandbox containers currently mount the host repo common .git directory writable at /workspace/.repo-git. Diagnose and f…
- Reduce cbox sandbox startup latency by parallelizing health checks Completed — Run post-boot sandbox health checks concurrently instead of sequentially to reduce worst-case startup delay.
- Resume conflicted task PRs in the same worktree — agent before human escalation CompletedPR #842PR at 2026-03-27T05:25:49-07:00 — When auto-rebase fails for a conflicted task PR, dispatch an agent to the existing worktree to resolve conflicts before…
- Route manager updates for active tasks into the running worktree or a follow-up task CompletedPR #789 — Prevent manager-side edits to already-running root task files on main. Detect when a task already has an active worktre…
- Scope just server bindings by execution context Completed — Make localhost the default host binding for local recipes and keep 0.0.0.0 where container access requires it.
- Simplify PR checklist enforcement and reduce brittle PR body sync CompletedPR #731PR at 2026-03-23T02:04:58-07:00 — Reduce false-negative PR checklist failures by removing redundant gates, consolidating label definitions, and making en…
- Task-new authoring flow with full worksheet and strict validate CompletedPR #777PR at 2026-03-24T20:33:23-07:00 — Improve task creation so new files include Context Possible Solutions with Recommended and Plan not only Problem. Renam…
- Tasks server Jinja macro expansion for plans markdown Completed — Serve tasks markdown with MkDocs-style macro calls expanded via Jinja and tasks/macros.py. Load extra from mkdocs.yml.…
- Tasks server universal doc-context chat sidebar CompletedPR #778PR at 2026-03-24T20:36:26-07:00 — FastAPI tasks-server chat on every page with doc context, bounded tools (read search edit repo), subprocess Claude sub-…
- Add cbox test command for running visual tests locally in Linux container Done — Add a repeatable `cbox` command that runs visual tests locally in the Linux container used by CI.
- Add just tasks serve composing heartbeat loop and plans server Completed — Add a justfile recipe tasks serve (or equivalent) that starts the background task-manager heartbeat for the default own…
- Consolidate local dispatch and review scripts behind shared implementation Completed — Refactor the new local worktree dispatch and review tooling so scripts/dispatch, scripts/dispatch-kill, scripts/review,…
- Design escalation-manager agent for stuck task triage and restart decisions Completed — Define a future operator/agent layer above heartbeat that can inspect stale tasks, summarize likely failure modes, reco…
- Fix tasks server /status browse links for legacy master_plans paths Completed — QA on http://127.0.0.1:8005/status: task links pointed at /browse/master_plans/workstreams/... and returned 404 after r…
- Fix tasks server /status task lists to match heartbeat groups CompletedPR #768PR at 2026-03-24T20:20:23-07:00 — Status cards use snapshot counts from classify_tasks but _classify_task_sections buckets tasks incorrectly (registered/…
- Group task manager attention and escalation UI by task instead of raw signals Completed — Escalation signals table and heartbeat text show one row per raw signal. Group by task so each task appears once with a…
- Improve merge flow guidance for worktree-bound local branches Completed — Document and handle expected local branch deletion warnings after `gh pr merge --delete-branch` when branch is checked…
- Jared — package task-manager orchestration (heartbeat, skills, layout) Completed — Plan and name the local task-manager orchestration surface ("Jared"): where code and docs live, which skills belong in…
- Local visual regression artifacts, PR/task links, and task-serve Viz changes timeline CompletedPR #772PR at 2026-03-24T20:20:23-07:00 — Rework visual regression approval: move away from Linux-only CI as the sole source of truth and toward local runs that…
- Make tasks server the primary master plans doc surface deprecating MkDocs CompletedPR #719PR at 2026-03-22T22:12:35-07:00 — After the tasks server is stable, update contributor and operator docs to prefer tasks serve, reduce or remove duplicat…
- Redesign manager status overview around active tasks table and stale-active cleanup CompletedPR #790 — Replace split Needs Attention/Escalation sections with an operator-first Active tasks table for runtime-managed tasks,…
- Smarter cbox cleanup: detect squash-merged PRs, ignore sandbox artifacts, scan all worktrees Done — Make cbox cleanup catch squash-merged branches, ignore .claude-sessions-sandbox/ as dirty state, check commits-ahead, a…
- Tasks server docs UI — MkDocs parity plus interactive task actions CompletedPR #758PR at 2026-03-24T01:30:35-07:00 — Ship a /docs shell on the tasks server that mirrors tasks/mkdocs.yml left nav (milestones, workstreams, team, guidebook…
- Upgrade Jared task-manager visibility orchestration and operator UX Cancelled — Cancelled as obsolete after the operator surface consolidated around tasks server, heartbeat-first workflows, and remov…
- Implement Jared layout — scripts grouping and sync-skills reconciliation Cancelled — Cancelled as obsolete after the repo converged on the tasks-named task system surface. The remaining useful consolidati…
- Rename master_plans directory to tasks and align CLI paths Completed — Repo-wide rename of master_plans to tasks, update PYTHONPATH, justfile, tasks server and CLI roots, CI, skills, and age…
developer tooling, local workflow reliability, and deployment execution safety is hardened enough for regular use by multiple internal teams and initial design partners, with a predictable response loop for issues and requests. Quality expectations are documented, and prioritized improvements from real usage are actively incorporated into delivery.
- M2 design-partner closure and readiness decision — Own the M2 closure artifact: verify internal-adoption and design-partner readiness, maintain the operating checklist, t…
- Master Plans CLI next-stage guidance command CompletedPR #781PR at 2026-03-24T20:50:47-07:00 — Add an advisory `plans task check` command that inspects a task's narrative sections, reports which are incomplete, ide…
Launch scope for developer tooling, local workflow reliability, and deployment execution safety is complete, externally explainable, and supportable: user-facing behavior is stable, documentation is publishable, and operational ownership is explicit. Remaining gaps are non-blocking, risk-assessed, and tracked as post-launch follow-up rather than unresolved launch debt.
- M3 public launch closure, operations, and go-no-go — Own the M3 closure artifact: verify public-launch scope, operating readiness, launch checklists, and owner-based closur…
- Pilot Claude Code auto mode instead of skip-permissions CompletedPR #784PR at 2026-03-24T22:08:59-07:00 — Anthropic shipped Claude Code auto mode: a classifier gates destructive or high-risk tool use while allowing routine ed…
Post-launch stabilization is complete for developer tooling, local workflow reliability, and deployment execution safety: recurring incidents are reduced, support burden is lower, and quality gates are enforced consistently before release. The team has a repeatable operating model for maintenance, regression prevention, and measured reliability improvements.
- M4 v1.0 operating model closure and sign-off — Own the M4 closure artifact: verify the stabilization bar, operating model, regression-prevention posture, and owner-ba…
- Add review path filtering for markdown-heavy mixed diffs — Teach scripts/review and the /pr workflow how to scope review input so large mixed code+markdown branches can be review…
v1.2 delivers meaningful depth improvements in developer tooling, local workflow reliability, and deployment execution safety based on observed usage and retention signals, not just roadmap intent. Enhancements improve real customer outcomes, and release readiness is demonstrated through metrics, regression coverage, and clear migration guidance where relevant.
- M5 v1.2 release closure and migration sign-off — Own the M5 closure artifact: verify release readiness, migration guidance, regression coverage, and owner-based gap clo…
Long-horizon opportunities for developer tooling, local workflow reliability, and deployment execution safety are captured as concrete hypotheses with user impact, prerequisites, and evaluation criteria. Ideas are ranked by strategic value and feasibility so future investment decisions can be made quickly with less rediscovery.
- MX far-future portfolio review and disposition — Own the far-future milestone review artifact: confirm the portfolio is current, rank ideas with prerequisites and owner…