Sustainable operating model

ID	M4_V1_0_LAUNCH-MCP_ANALYST_AGENT-03
Status	not_started
Priority	p1
Milestone	m4-v1-0-launch
Owner	data-ai-engineer-architect

Problem

The MCP server and agent evaluation framework are maintained through ad-hoc heroics rather than a defined operating model. There is no documented on-call rotation for MCP tool failures, no release cadence for tool schema or prompt updates, no triage process for incoming eval regressions, and no runbook for common operational tasks (deploying a new tool, deprecating an old one, rotating cached profiles). As the system grows beyond a single maintainer, this lack of operational structure will lead to dropped incidents, inconsistent releases, and contributor confusion about ownership.

Context

A launch can succeed briefly even with fuzzy ownership, but AI agent tool interfaces, execution workflows, and eval-driven behavior tuning will drift quickly without a clear model for maintenance, triage, and decision-making.
This task is about defining who owns backlog hygiene, review standards, incidents, documentation, and the cadence for future improvements.
Expected touchpoints include dataface/ai/, MCP/tool contracts, cloud chat surfaces, eval runners, and prompt artifacts, runbooks, planning docs, and team processes that currently rely too heavily on shared memory.

Possible Solutions

A - Let the current contributors coordinate informally: low overhead, but it becomes brittle as scope and contributors grow.
B - Recommended: define a lightweight operating model with named owners and cadences: make maintenance, incident response, prioritization, and release decisions explicit.
C - Centralize all ownership in one person or team indefinitely: clearer in the short term, but usually unsustainable and a bottleneck.

Plan

Map the recurring operational decisions around AI agent tool interfaces, execution workflows, and eval-driven behavior tuning and identify where ownership, handoff, or cadence is currently unclear.
Document the operating model: owners, review loops, incident or support handling, documentation upkeep, and backlog-management expectations.
Align the model with the actual command/docs/test surfaces that people use day to day so it is operational rather than aspirational.
Publish the model in the relevant planning/runbook surfaces and refine it after one real cycle of use.

Implementation Progress

Review Feedback

[ ] Review cleared