Define normalized external benchmark case and result contracts for apps/evals
Problem
Define the internal contracts that map external benchmark tasks, metadata, and outputs into Dataface eval types without hiding benchmark-specific semantics.
Context
apps/evals/sql/runner.py,apps/evals/sql/backends.py, andapps/evals/sql/scorer.pyalready define the internal eval loop and artifact shape.- The current internal benchmark is comparatively uniform: one task family, one local artifact shape, one main dashboard view.
- External benchmarks add benchmark name, split, version, dialect, scorer, and environment differences that should survive normalization.
- If the contract is too generic, dashboards will lie. If the contract is too benchmark-specific, adapters will fork the whole system.
Possible Solutions
- Recommended: define a shared external benchmark contract with explicit provenance fields plus benchmark-specific metadata. Add a small common contract for case metadata and run provenance, then keep benchmark-specific details in namespaced metadata instead of flattening them away.
Why this is recommended:
- preserves reuse in the runner and dashboard layers
- keeps comparisons honest
- avoids creating separate eval stacks for each benchmark
- Reuse the current internal case/result shape with only a free-form metadata blob.
Trade-off: minimal changes, but provenance becomes optional and dashboards will drift toward opaque labels.
- Create one full contract per benchmark family.
Trade-off: truest to each benchmark, but it throws away the shared apps/evals architecture.
Plan
- Define the required normalized fields for a benchmark case: - benchmark - benchmark_version - split - dialect - task_type - case_id - question - gold_sql or benchmark-native gold representation
- Define the required run-level provenance fields: - scorer type - environment type - local-vs-official comparability - dataset root or release identifier
- Decide where the contract lives in
apps/evals/sql/so loaders and dashboards can share it. - Update artifact expectations so
summary.jsonandresults.jsonlhave stable provenance columns for downstream dashboards. - Write the contract down in code-facing docs before adapter implementation starts.
Implementation Progress
QA Exploration
- [x] QA exploration completed (or N/A for non-UI tasks)
N/A - planning/task-definition work.
Review Feedback
- [ ] Review cleared