Add Spider 2.0 Lite adapter and runner support
Problem
Integrate Spider 2.0-Lite into apps/evals with benchmark-aware loading, environment assumptions, and reproducible baseline runs.
Context
Spider 2.0-Liteis a good enterprise-style calibration target because it stresses long context, larger schemas, and more realistic operational complexity than classic Spider.- It is also meaningfully harder to integrate than BIRD because benchmark settings and environment assumptions vary more.
- We should treat this as the first "harder" external adapter after the shared contract is defined, not as the first benchmark integration.
Possible Solutions
- Recommended: integrate
Spider 2.0-Liteas a benchmark-aware adapter that fits the shared external contract but preserves its benchmark-specific settings in metadata. This keeps the runner shared while making it obvious which local assumptions differ from more official or hosted evaluation settings.
Why this is recommended:
- gives us a stronger public realism test
- forces the contract to handle a less toy-like benchmark
- avoids prematurely taking on hosted or interactive Spider variants
- Skip Lite and wait for a fuller hosted Spider 2.0 path.
Trade-off: more ambitious, but it blocks useful progress on environment and provenance work.
- Treat Spider 2.0-Lite as just another internal JSON dataset.
Trade-off: easy, but it hides exactly the benchmark semantics this task is meant to preserve.
Plan
- Add a
Spider 2.0-Liteloader that emits the shared external case contract plus Spider-specific metadata. - Define any benchmark-specific execution/scoring caveats that affect comparability.
- Add a reproducible smoke baseline and durable artifacts for the benchmark.
- Add tests that verify: - benchmark cases load cleanly - provenance survives into output artifacts - dashboard queries can slice by Spider benchmark metadata
- Document why this adapter covers Lite first and defers Snow, DBT, and more hosted variants.
Implementation Progress
QA Exploration
- [x] QA exploration completed (or N/A for non-UI tasks)
N/A - implementation task, but not a browser-flow task.
Review Feedback
- [ ] Review cleared