tasks/workstreams/mcp-analyst-agent/tasks/define-normalized-external-benchmark-case-and-result-contracts-for-apps-evals.md


type: task id: MCP_ANALYST_AGENT-DEFINE_NORMALIZED_EXTERNAL_BENCHMARK_CASE_AND_RESULT_CONTRACTS_FOR_APPS_EVALS title: Define normalized external benchmark case and result contracts for apps/evals description: Define the internal contracts that map external benchmark tasks, metadata, and outputs into Dataface eval types without hiding benchmark-specific semantics. milestone: m2-internal-adoption-design-partners owner: data-ai-engineer-architect status: not_started priority: p1 initiative: external-text-to-sql-benchmarks-and-sota-calibration dependencies: - plan-external-text-to-sql-benchmark-adoption-order-and-constraints


Define normalized external benchmark case and result contracts for apps/evals

Problem

Define the internal contracts that map external benchmark tasks, metadata, and outputs into Dataface eval types without hiding benchmark-specific semantics.

Context

Possible Solutions

  1. Recommended: define a shared external benchmark contract with explicit provenance fields plus benchmark-specific metadata. Add a small common contract for case metadata and run provenance, then keep benchmark-specific details in namespaced metadata instead of flattening them away.

Why this is recommended:

  1. Reuse the current internal case/result shape with only a free-form metadata blob.

Trade-off: minimal changes, but provenance becomes optional and dashboards will drift toward opaque labels.

  1. Create one full contract per benchmark family.

Trade-off: truest to each benchmark, but it throws away the shared apps/evals architecture.

Plan

  1. Define the required normalized fields for a benchmark case: - benchmark - benchmark_version - split - dialect - task_type - case_id - question - gold_sql or benchmark-native gold representation
  2. Define the required run-level provenance fields: - scorer type - environment type - local-vs-official comparability - dataset root or release identifier
  3. Decide where the contract lives in apps/evals/sql/ so loaders and dashboards can share it.
  4. Update artifact expectations so summary.json and results.jsonl have stable provenance columns for downstream dashboards.
  5. Write the contract down in code-facing docs before adapter implementation starts.

Implementation Progress

<!-- Technical details, key decisions, code changes. Append as work progresses. -->

QA Exploration

<!-- For UI/browser tasks: use Playwright MCP to explore the running app. Record bugs found, fixes applied, and suggestions for future work. Skip for non-UI tasks (mark N/A). -->

N/A - planning/task-definition work.

Review Feedback

<!-- Reviewer comments, what was changed in response, and sign-off. -->