Add contamination-resistant benchmark splits and holdout guardrails

ID	MCP_ANALYST_AGENT-ADD_CONTAMINATION_RESISTANT_BENCHMARK_SPLITS_AND_HOLDOUT_GUARDRAILS
Status	not_started
Priority	p2
Milestone	m4-v1-0-launch
Owner	data-ai-engineer-architect
Initiative	benchmark-driven-text-to-sql-and-discovery-evals

Problem

As the team iterates on prompts, retrieval, planning, and repair loops, it becomes easier to accidentally optimize against the visible benchmark itself. Without proper split discipline, the leaderboard can start measuring how well the team tuned to known cases rather than how well the system generalizes.

We need contamination-resistant benchmark splits and lightweight workflow guardrails so eval results remain credible.

Context

The current benchmark work already distinguishes canary-style runs and broader evaluations, but the process still needs stronger separation between:

cases used for day-to-day iteration
cases used for milestone decisions
cases held back to catch quiet overfitting

This task is about measurement hygiene, not model architecture. It should affect benchmark layout, task workflow, and reporting conventions.

Possible Solutions

Recommended: define explicit iteration, validation, and holdout splits with workflow guardrails. Keep a small iteration slice for rapid feedback, a broader validation set for normal comparisons, and a held-out slice that is not used for routine tuning.

Why this is recommended:

preserves fast iteration
reduces accidental overfitting
makes benchmark claims more credible

Keep one shared benchmark for everything.

Trade-off: easiest operationally, but least trustworthy as iteration pressure grows.

Maintain a holdout set informally without tooling or documentation.

Trade-off: better than nothing, but too easy to violate accidentally.

Plan

Define benchmark split roles and naming conventions.
Update benchmark metadata and/or CLI conventions so runs clearly record which split they used.
Document which splits are allowed for fast iteration and which are reserved for milestone-quality evaluation.
Add light reporting checks so dashboards distinguish iteration results from holdout results.
Use the holdout slice sparingly when evaluating larger retrieval or generation changes.

Success criteria

benchmark usage is split into iteration vs holdout paths
routine tuning does not silently consume the holdout slice
milestone-quality claims have a cleaner measurement basis

Implementation Progress

Not started.

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A - benchmark governance task.

Review Feedback

No review feedback yet.

[ ] Review cleared