type: task id: MCP_ANALYST_AGENT-DEFINE_NORMALIZED_EXTERNAL_BENCHMARK_CASE_AND_RESULT_CONTRACTS_FOR_APPS_EVALS title: Define normalized external benchmark case and result contracts for apps/evals description: Define the internal contracts that map external benchmark tasks, metadata, and outputs into Dataface eval types without hiding benchmark-specific semantics. milestone: m2-internal-adoption-design-partners owner: data-ai-engineer-architect status: not_started priority: p1 initiative: external-text-to-sql-benchmarks-and-sota-calibration dependencies: - plan-external-text-to-sql-benchmark-adoption-order-and-constraints
Define the internal contracts that map external benchmark tasks, metadata, and outputs into Dataface eval types without hiding benchmark-specific semantics.
apps/evals/sql/runner.py, apps/evals/sql/backends.py, and apps/evals/sql/scorer.py already define the internal eval loop and artifact shape.Why this is recommended:
Trade-off: minimal changes, but provenance becomes optional and dashboards will drift toward opaque labels.
Trade-off: truest to each benchmark, but it throws away the shared apps/evals architecture.
apps/evals/sql/ so loaders and dashboards can share it.summary.json and results.jsonl have stable provenance columns for downstream dashboards.<!-- Technical details, key decisions, code changes. Append as work progresses. -->
<!-- For UI/browser tasks: use Playwright MCP to explore the running app. Record bugs found, fixes applied, and suggestions for future work. Skip for non-UI tasks (mark N/A). -->
N/A - planning/task-definition work.
<!-- Reviewer comments, what was changed in response, and sign-off. -->