Add judge calibration and disagreement audits for text-to-SQL evals
Problem
As the eval stack leans more on LLM judging for edge cases, it becomes harder to tell whether a leaderboard movement reflects a real model improvement or a noisy judge decision. If deterministic scoring and LLM judging disagree often, the team needs to know where and why before treating those results as reliable.
We need a calibration and disagreement audit path so the judge remains an explicit measurement tool instead of a vague source of truth.
Context
The current eval system already combines deterministic signals and LLM-based judgment. That hybrid design is reasonable, but it also creates an operational question: when the two disagree, which one should the team trust, and how frequently does that happen?
Likely disagreement sources include:
- SQL that is structurally different but semantically equivalent
- cases where deterministic scoring misses subtle semantic alignment
- cases where the judge over-credits a query that is not actually correct
This task should not aim to replace the judge. It should make judge behavior inspectable and bounded.
Possible Solutions
- Recommended: add disagreement logging, targeted audits, and judge calibration reports. Persist disagreement cases between deterministic and judge-based scoring, then review a sample regularly to calibrate prompts, policies, and confidence in the metric.
Why this is recommended:
- gives direct visibility into judge reliability
- helps prevent chasing judge noise
- supports better leaderboard interpretation
- Trust the judge whenever deterministic scoring is uncertain.
Trade-off: simple, but it hides measurement risk and can create false confidence.
- Remove judge-based scoring entirely.
Trade-off: deterministic-only scoring is cleaner, but it gives up useful nuance on genuinely hard cases.
Plan
- Define which disagreement cases should be captured automatically.
- Persist disagreement flags and relevant evidence into eval artifacts.
- Add a review workflow or dashboard slice for disagreement audits.
- Review a sample of disagreements and refine the judge prompt or policy if repeated failure patterns appear.
- Document when the team should trust deterministic scoring, judge scoring, or manual review.
Success criteria
- disagreement cases are visible rather than hidden
- the team has a repeatable way to audit judge reliability
- leaderboard interpretation becomes more trustworthy over time
Implementation Progress
Not started.
QA Exploration
- [x] QA exploration completed (or N/A for non-UI tasks)
N/A - eval governance task.
Review Feedback
No review feedback yet.
- [ ] Review cleared