Dataface Tasks

Add judge calibration and disagreement audits for text-to-SQL evals

IDMCP_ANALYST_AGENT-ADD_JUDGE_CALIBRATION_AND_DISAGREEMENT_AUDITS_FOR_TEXT_TO_SQL_EVALS
Statusnot_started
Priorityp2
Milestonem4-v1-0-launch
Ownerdata-ai-engineer-architect

Problem

As the eval stack leans more on LLM judging for edge cases, it becomes harder to tell whether a leaderboard movement reflects a real model improvement or a noisy judge decision. If deterministic scoring and LLM judging disagree often, the team needs to know where and why before treating those results as reliable.

We need a calibration and disagreement audit path so the judge remains an explicit measurement tool instead of a vague source of truth.

Context

The current eval system already combines deterministic signals and LLM-based judgment. That hybrid design is reasonable, but it also creates an operational question: when the two disagree, which one should the team trust, and how frequently does that happen?

Likely disagreement sources include:

  • SQL that is structurally different but semantically equivalent
  • cases where deterministic scoring misses subtle semantic alignment
  • cases where the judge over-credits a query that is not actually correct

This task should not aim to replace the judge. It should make judge behavior inspectable and bounded.

Possible Solutions

  1. Recommended: add disagreement logging, targeted audits, and judge calibration reports. Persist disagreement cases between deterministic and judge-based scoring, then review a sample regularly to calibrate prompts, policies, and confidence in the metric.

Why this is recommended:

  • gives direct visibility into judge reliability
  • helps prevent chasing judge noise
  • supports better leaderboard interpretation
  1. Trust the judge whenever deterministic scoring is uncertain.

Trade-off: simple, but it hides measurement risk and can create false confidence.

  1. Remove judge-based scoring entirely.

Trade-off: deterministic-only scoring is cleaner, but it gives up useful nuance on genuinely hard cases.

Plan

  1. Define which disagreement cases should be captured automatically.
  2. Persist disagreement flags and relevant evidence into eval artifacts.
  3. Add a review workflow or dashboard slice for disagreement audits.
  4. Review a sample of disagreements and refine the judge prompt or policy if repeated failure patterns appear.
  5. Document when the team should trust deterministic scoring, judge scoring, or manual review.

Success criteria

  • disagreement cases are visible rather than hidden
  • the team has a repeatable way to audit judge reliability
  • leaderboard interpretation becomes more trustworthy over time

Implementation Progress

Not started.

QA Exploration

  • [x] QA exploration completed (or N/A for non-UI tasks)

N/A - eval governance task.

Review Feedback

No review feedback yet.

  • [ ] Review cleared