Dataface Tasks

Add LiveSQLBench adapter and release-tracking workflow

IDMCP_ANALYST_AGENT-ADD_LIVESQLBENCH_ADAPTER_AND_RELEASE_TRACKING_WORKFLOW
Statusnot_started
Priorityp2
Milestonem2-internal-adoption-design-partners
Ownerdata-ai-engineer-architect
Initiativeexternal-text-to-sql-benchmarks-and-sota-calibration

Problem

Add a second-wave integration for LiveSQLBench with explicit handling for release versions, hidden-vs-open splits, and evolving benchmark context.

Context

  • LiveSQLBench is appealing because it is newer, more dynamic, and more contamination-aware than older static benchmarks.
  • That same dynamism adds versioning and release-management complexity that makes it a poor first integration target.
  • This task exists so the second-wave work is already scoped and ordered instead of becoming a vague "maybe later" benchmark bucket.

Possible Solutions

  1. Recommended: add LiveSQLBench only after the shared contract and first adapters exist, with explicit release/version tracking. Treat it as a release-aware benchmark family where artifact provenance records exactly which release and split were used.

Why this is recommended:

  • fits the strengths of LiveSQLBench
  • avoids mixing dynamic benchmark semantics into the first adapter wave
  • gives a clean path to evolve with new releases
  1. Fold LiveSQLBench into the first implementation wave.

Trade-off: maximizes ambition, but adds too much benchmark-management complexity before the shared contract is proven.

  1. Snapshot one release and treat it as static forever.

Trade-off: simplest operationally, but loses much of the benchmark's value.

Plan

  1. Define how benchmark release/version identifiers appear in normalized case metadata and run provenance.
  2. Decide which LiveSQLBench slice is practical for local development and what should remain deferred.
  3. Add a loader and run mode that pins a specific release instead of silently drifting.
  4. Add dashboard slices that distinguish release-to-release movement from model movement.
  5. Document how new releases should be introduced without contaminating prior comparisons.

Implementation Progress

QA Exploration

  • [x] QA exploration completed (or N/A for non-UI tasks)

N/A - implementation task, but not a browser-flow task.

Review Feedback

  • [ ] Review cleared