Add semantic SQL diff and partial-credit scoring

ID	MCP_ANALYST_AGENT-ADD_SEMANTIC_SQL_DIFF_AND_PARTIAL_CREDIT_SCORING
Status	not_started
Priority	p1
Milestone	m3-public-launch
Owner	data-ai-engineer-architect

Problem

Binary pass/fail scoring is useful, but it throws away too much signal for iteration. A generated query can use the right tables and filters while missing the grouping. Another can get the aggregation right but use the wrong date grain. Both show up as simple failures even though they represent very different system quality.

We need semantic partial-credit scoring so benchmark changes can show whether the model is getting closer to the target even when it does not yet produce fully equivalent SQL.

Context

The current eval stack already has deterministic scoring, grounding analysis, and some parse-level structure. That gives us a foundation for a richer scorer without replacing the whole eval system.

Likely semantic components worth separating include:

referenced tables
referenced columns
filters
joins
aggregation intent
grouping intent
ordering and limit

This should remain a complement to pass/fail scoring, not a replacement for it.

Possible Solutions

Recommended: add deterministic semantic-component scoring alongside existing pass/fail metrics. Keep the current pass/fail verdicts, but add component-level scores so dashboards and experiments can see near misses and directional improvement.

Why this is recommended:

preserves simple headline metrics
adds much richer debugging and comparison value
stays compatible with existing leaderboard infrastructure

Keep binary scoring only.

Trade-off: simplest operationally, but it slows iteration because every miss looks equally bad.

Move entirely to an LLM-judge rubric for partial credit.

Trade-off: more flexible, but less stable. Deterministic component scoring should come first.

Plan

Define the semantic components that are stable enough to score deterministically.
Extend the scorer to compute component-level matches for generated vs gold SQL.
Persist those scores in result artifacts and summaries.
Add leaderboard queries or tables that expose partial-credit slices.
Keep the original binary verdicts so historical comparisons remain intact.

Files likely involved

apps/evals/sql/scorer.py
apps/evals/sql/types.py
apps/evals/sql/runner.py
eval dashboard files under apps/evals/faces/

Success criteria

semantic near misses are measurable
dashboard comparisons can show directional improvement before full equivalence improves
the scoring contract remains deterministic and explainable

Implementation Progress

Not started.

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A - scoring and eval task.

Review Feedback

No review feedback yet.

[ ] Review cleared