Dataface Tasks

Add semantic SQL diff and partial-credit scoring

IDMCP_ANALYST_AGENT-ADD_SEMANTIC_SQL_DIFF_AND_PARTIAL_CREDIT_SCORING
Statusnot_started
Priorityp1
Milestonem3-public-launch
Ownerdata-ai-engineer-architect

Problem

Binary pass/fail scoring is useful, but it throws away too much signal for iteration. A generated query can use the right tables and filters while missing the grouping. Another can get the aggregation right but use the wrong date grain. Both show up as simple failures even though they represent very different system quality.

We need semantic partial-credit scoring so benchmark changes can show whether the model is getting closer to the target even when it does not yet produce fully equivalent SQL.

Context

The current eval stack already has deterministic scoring, grounding analysis, and some parse-level structure. That gives us a foundation for a richer scorer without replacing the whole eval system.

Likely semantic components worth separating include:

  • referenced tables
  • referenced columns
  • filters
  • joins
  • aggregation intent
  • grouping intent
  • ordering and limit

This should remain a complement to pass/fail scoring, not a replacement for it.

Possible Solutions

  1. Recommended: add deterministic semantic-component scoring alongside existing pass/fail metrics. Keep the current pass/fail verdicts, but add component-level scores so dashboards and experiments can see near misses and directional improvement.

Why this is recommended:

  • preserves simple headline metrics
  • adds much richer debugging and comparison value
  • stays compatible with existing leaderboard infrastructure
  1. Keep binary scoring only.

Trade-off: simplest operationally, but it slows iteration because every miss looks equally bad.

  1. Move entirely to an LLM-judge rubric for partial credit.

Trade-off: more flexible, but less stable. Deterministic component scoring should come first.

Plan

  1. Define the semantic components that are stable enough to score deterministically.
  2. Extend the scorer to compute component-level matches for generated vs gold SQL.
  3. Persist those scores in result artifacts and summaries.
  4. Add leaderboard queries or tables that expose partial-credit slices.
  5. Keep the original binary verdicts so historical comparisons remain intact.

Files likely involved

  • apps/evals/sql/scorer.py
  • apps/evals/sql/types.py
  • apps/evals/sql/runner.py
  • eval dashboard files under apps/evals/faces/

Success criteria

  • semantic near misses are measurable
  • dashboard comparisons can show directional improvement before full equivalence improves
  • the scoring contract remains deterministic and explainable

Implementation Progress

Not started.

QA Exploration

  • [x] QA exploration completed (or N/A for non-UI tasks)

N/A - scoring and eval task.

Review Feedback

No review feedback yet.

  • [ ] Review cleared