Define normalized external benchmark case and result contracts for apps/evals

ID	MCP_ANALYST_AGENT-DEFINE_NORMALIZED_EXTERNAL_BENCHMARK_CASE_AND_RESULT_CONTRACTS_FOR_APPS_EVALS
Status	not_started
Priority	p1
Milestone	m2-internal-adoption-design-partners
Owner	data-ai-engineer-architect
Initiative	external-text-to-sql-benchmarks-and-sota-calibration

Problem

Define the internal contracts that map external benchmark tasks, metadata, and outputs into Dataface eval types without hiding benchmark-specific semantics.

Context

apps/evals/sql/runner.py, apps/evals/sql/backends.py, and apps/evals/sql/scorer.py already define the internal eval loop and artifact shape.
The current internal benchmark is comparatively uniform: one task family, one local artifact shape, one main dashboard view.
External benchmarks add benchmark name, split, version, dialect, scorer, and environment differences that should survive normalization.
If the contract is too generic, dashboards will lie. If the contract is too benchmark-specific, adapters will fork the whole system.

Possible Solutions

Recommended: define a shared external benchmark contract with explicit provenance fields plus benchmark-specific metadata. Add a small common contract for case metadata and run provenance, then keep benchmark-specific details in namespaced metadata instead of flattening them away.

Why this is recommended:

preserves reuse in the runner and dashboard layers
keeps comparisons honest
avoids creating separate eval stacks for each benchmark

Reuse the current internal case/result shape with only a free-form metadata blob.

Trade-off: minimal changes, but provenance becomes optional and dashboards will drift toward opaque labels.

Create one full contract per benchmark family.

Trade-off: truest to each benchmark, but it throws away the shared apps/evals architecture.

Plan

Define the required normalized fields for a benchmark case: - benchmark - benchmark_version - split - dialect - task_type - case_id - question - gold_sql or benchmark-native gold representation
Define the required run-level provenance fields: - scorer type - environment type - local-vs-official comparability - dataset root or release identifier
Decide where the contract lives in apps/evals/sql/ so loaders and dashboards can share it.
Update artifact expectations so summary.json and results.jsonl have stable provenance columns for downstream dashboards.
Write the contract down in code-facing docs before adapter implementation starts.

Implementation Progress

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A - planning/task-definition work.

Review Feedback

[ ] Review cleared