Dataface Tasks

Add contamination-resistant benchmark splits and holdout guardrails

IDMCP_ANALYST_AGENT-ADD_CONTAMINATION_RESISTANT_BENCHMARK_SPLITS_AND_HOLDOUT_GUARDRAILS
Statusnot_started
Priorityp2
Milestonem4-v1-0-launch
Ownerdata-ai-engineer-architect
Initiativebenchmark-driven-text-to-sql-and-discovery-evals

Problem

As the team iterates on prompts, retrieval, planning, and repair loops, it becomes easier to accidentally optimize against the visible benchmark itself. Without proper split discipline, the leaderboard can start measuring how well the team tuned to known cases rather than how well the system generalizes.

We need contamination-resistant benchmark splits and lightweight workflow guardrails so eval results remain credible.

Context

The current benchmark work already distinguishes canary-style runs and broader evaluations, but the process still needs stronger separation between:

  • cases used for day-to-day iteration
  • cases used for milestone decisions
  • cases held back to catch quiet overfitting

This task is about measurement hygiene, not model architecture. It should affect benchmark layout, task workflow, and reporting conventions.

Possible Solutions

  1. Recommended: define explicit iteration, validation, and holdout splits with workflow guardrails. Keep a small iteration slice for rapid feedback, a broader validation set for normal comparisons, and a held-out slice that is not used for routine tuning.

Why this is recommended:

  • preserves fast iteration
  • reduces accidental overfitting
  • makes benchmark claims more credible
  1. Keep one shared benchmark for everything.

Trade-off: easiest operationally, but least trustworthy as iteration pressure grows.

  1. Maintain a holdout set informally without tooling or documentation.

Trade-off: better than nothing, but too easy to violate accidentally.

Plan

  1. Define benchmark split roles and naming conventions.
  2. Update benchmark metadata and/or CLI conventions so runs clearly record which split they used.
  3. Document which splits are allowed for fast iteration and which are reserved for milestone-quality evaluation.
  4. Add light reporting checks so dashboards distinguish iteration results from holdout results.
  5. Use the holdout slice sparingly when evaluating larger retrieval or generation changes.

Success criteria

  • benchmark usage is split into iteration vs holdout paths
  • routine tuning does not silently consume the holdout slice
  • milestone-quality claims have a cleaner measurement basis

Implementation Progress

Not started.

QA Exploration

  • [x] QA exploration completed (or N/A for non-UI tasks)

N/A - benchmark governance task.

Review Feedback

No review feedback yet.

  • [ ] Review cleared