Dataface Tasks

Define normalized external benchmark case and result contracts for apps/evals

IDMCP_ANALYST_AGENT-DEFINE_NORMALIZED_EXTERNAL_BENCHMARK_CASE_AND_RESULT_CONTRACTS_FOR_APPS_EVALS
Statusnot_started
Priorityp1
Milestonem2-internal-adoption-design-partners
Ownerdata-ai-engineer-architect
Initiativeexternal-text-to-sql-benchmarks-and-sota-calibration

Problem

Define the internal contracts that map external benchmark tasks, metadata, and outputs into Dataface eval types without hiding benchmark-specific semantics.

Context

  • apps/evals/sql/runner.py, apps/evals/sql/backends.py, and apps/evals/sql/scorer.py already define the internal eval loop and artifact shape.
  • The current internal benchmark is comparatively uniform: one task family, one local artifact shape, one main dashboard view.
  • External benchmarks add benchmark name, split, version, dialect, scorer, and environment differences that should survive normalization.
  • If the contract is too generic, dashboards will lie. If the contract is too benchmark-specific, adapters will fork the whole system.

Possible Solutions

  1. Recommended: define a shared external benchmark contract with explicit provenance fields plus benchmark-specific metadata. Add a small common contract for case metadata and run provenance, then keep benchmark-specific details in namespaced metadata instead of flattening them away.

Why this is recommended:

  • preserves reuse in the runner and dashboard layers
  • keeps comparisons honest
  • avoids creating separate eval stacks for each benchmark
  1. Reuse the current internal case/result shape with only a free-form metadata blob.

Trade-off: minimal changes, but provenance becomes optional and dashboards will drift toward opaque labels.

  1. Create one full contract per benchmark family.

Trade-off: truest to each benchmark, but it throws away the shared apps/evals architecture.

Plan

  1. Define the required normalized fields for a benchmark case: - benchmark - benchmark_version - split - dialect - task_type - case_id - question - gold_sql or benchmark-native gold representation
  2. Define the required run-level provenance fields: - scorer type - environment type - local-vs-official comparability - dataset root or release identifier
  3. Decide where the contract lives in apps/evals/sql/ so loaders and dashboards can share it.
  4. Update artifact expectations so summary.json and results.jsonl have stable provenance columns for downstream dashboards.
  5. Write the contract down in code-facing docs before adapter implementation starts.

Implementation Progress

QA Exploration

  • [x] QA exploration completed (or N/A for non-UI tasks)

N/A - planning/task-definition work.

Review Feedback

  • [ ] Review cleared