Add judge calibration and disagreement audits for text-to-SQL evals

ID	MCP_ANALYST_AGENT-ADD_JUDGE_CALIBRATION_AND_DISAGREEMENT_AUDITS_FOR_TEXT_TO_SQL_EVALS
Status	not_started
Priority	p2
Milestone	m4-v1-0-launch
Owner	data-ai-engineer-architect

Problem

As the eval stack leans more on LLM judging for edge cases, it becomes harder to tell whether a leaderboard movement reflects a real model improvement or a noisy judge decision. If deterministic scoring and LLM judging disagree often, the team needs to know where and why before treating those results as reliable.

We need a calibration and disagreement audit path so the judge remains an explicit measurement tool instead of a vague source of truth.

Context

The current eval system already combines deterministic signals and LLM-based judgment. That hybrid design is reasonable, but it also creates an operational question: when the two disagree, which one should the team trust, and how frequently does that happen?

Likely disagreement sources include:

SQL that is structurally different but semantically equivalent
cases where deterministic scoring misses subtle semantic alignment
cases where the judge over-credits a query that is not actually correct

This task should not aim to replace the judge. It should make judge behavior inspectable and bounded.

Possible Solutions

Recommended: add disagreement logging, targeted audits, and judge calibration reports. Persist disagreement cases between deterministic and judge-based scoring, then review a sample regularly to calibrate prompts, policies, and confidence in the metric.

Why this is recommended:

gives direct visibility into judge reliability
helps prevent chasing judge noise
supports better leaderboard interpretation

Trust the judge whenever deterministic scoring is uncertain.

Trade-off: simple, but it hides measurement risk and can create false confidence.

Remove judge-based scoring entirely.

Trade-off: deterministic-only scoring is cleaner, but it gives up useful nuance on genuinely hard cases.

Plan

Define which disagreement cases should be captured automatically.
Persist disagreement flags and relevant evidence into eval artifacts.
Add a review workflow or dashboard slice for disagreement audits.
Review a sample of disagreements and refine the judge prompt or policy if repeated failure patterns appear.
Document when the team should trust deterministic scoring, judge scoring, or manual review.

Success criteria

disagreement cases are visible rather than hidden
the team has a repeatable way to audit judge reliability
leaderboard interpretation becomes more trustworthy over time

Implementation Progress

Not started.

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A - eval governance task.

Review Feedback

No review feedback yet.

[ ] Review cleared