Expand text-to-SQL benchmark with paraphrase clusters

ID	MCP_ANALYST_AGENT-EXPAND_TEXT_TO_SQL_BENCHMARK_WITH_PARAPHRASE_CLUSTERS
Status	not_started
Priority	p2
Milestone	m4-v1-0-launch
Owner	data-ai-engineer-architect
Initiative	benchmark-driven-text-to-sql-and-discovery-evals

Problem

The current benchmark mostly measures whether the system can answer one phrasing of a question correctly. That is useful, but it still leaves a gap between benchmark performance and real user robustness. A system can quietly overfit visible benchmark wording while still failing on semantically equivalent paraphrases.

We need paraphrase clusters so the benchmark can measure robustness to wording variation, not just performance on one canonical prompt per intent.

Context

The cleaned text-to-SQL benchmark gives us a base set of logical query intents. That means we do not need to invent new intent categories from scratch. We need to expand selected cases into clusters of equivalent phrasings.

Paraphrase clusters should help answer questions like:

does retrieval still find the right working set when wording changes
does planning remain stable across phrasing changes
does the generator depend on benchmark-specific wording quirks

This should remain benchmark-disciplined. The goal is not infinite paraphrase generation. The goal is a curated robustness slice.

Possible Solutions

Recommended: create curated paraphrase clusters for a representative benchmark slice. Expand selected benchmark cases into 3-5 equivalent phrasings while keeping the same semantic target and gold answer expectations.

Why this is recommended:

directly measures robustness
keeps curation manageable
avoids confusing new semantic cases with wording variation

Generate paraphrases automatically at large scale and trust them.

Trade-off: cheap, but quality and equivalence drift become a problem quickly.

Keep one phrasing per intent forever.

Trade-off: simplest, but it under-measures real-world robustness.

Plan

Choose a representative benchmark slice across schemas and difficulty levels.
Create paraphrase clusters that preserve the same intended answer but vary wording style, specificity, and ordering.
Extend the benchmark schema so paraphrase siblings can be grouped under a shared intent identifier.
Update eval analysis so runs can be summarized by both case and paraphrase cluster.
Use the new slice to compare robustness of full-context, retrieval, and bounded non-one-shot approaches.

Success criteria

a meaningful subset of benchmark cases has curated paraphrase clusters
eval reports can compare robustness within a shared intent cluster
the benchmark becomes harder to overfit by wording alone

Implementation Progress

Not started.

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A - benchmark curation task.

Review Feedback

No review feedback yet.

[ ] Review cleared