Dataface Tasks

Add Spider 2.0 Lite adapter and runner support

IDMCP_ANALYST_AGENT-ADD_SPIDER_2_LITE_ADAPTER_AND_RUNNER_SUPPORT
Statusnot_started
Priorityp1
Milestonem2-internal-adoption-design-partners
Ownerdata-ai-engineer-architect
Initiativeexternal-text-to-sql-benchmarks-and-sota-calibration

Problem

Integrate Spider 2.0-Lite into apps/evals with benchmark-aware loading, environment assumptions, and reproducible baseline runs.

Context

  • Spider 2.0-Lite is a good enterprise-style calibration target because it stresses long context, larger schemas, and more realistic operational complexity than classic Spider.
  • It is also meaningfully harder to integrate than BIRD because benchmark settings and environment assumptions vary more.
  • We should treat this as the first "harder" external adapter after the shared contract is defined, not as the first benchmark integration.

Possible Solutions

  1. Recommended: integrate Spider 2.0-Lite as a benchmark-aware adapter that fits the shared external contract but preserves its benchmark-specific settings in metadata. This keeps the runner shared while making it obvious which local assumptions differ from more official or hosted evaluation settings.

Why this is recommended:

  • gives us a stronger public realism test
  • forces the contract to handle a less toy-like benchmark
  • avoids prematurely taking on hosted or interactive Spider variants
  1. Skip Lite and wait for a fuller hosted Spider 2.0 path.

Trade-off: more ambitious, but it blocks useful progress on environment and provenance work.

  1. Treat Spider 2.0-Lite as just another internal JSON dataset.

Trade-off: easy, but it hides exactly the benchmark semantics this task is meant to preserve.

Plan

  1. Add a Spider 2.0-Lite loader that emits the shared external case contract plus Spider-specific metadata.
  2. Define any benchmark-specific execution/scoring caveats that affect comparability.
  3. Add a reproducible smoke baseline and durable artifacts for the benchmark.
  4. Add tests that verify: - benchmark cases load cleanly - provenance survives into output artifacts - dashboard queries can slice by Spider benchmark metadata
  5. Document why this adapter covers Lite first and defers Snow, DBT, and more hosted variants.

Implementation Progress

QA Exploration

  • [x] QA exploration completed (or N/A for non-UI tasks)

N/A - implementation task, but not a browser-flow task.

Review Feedback

  • [ ] Review cleared