Expand text-to-SQL benchmark with paraphrase clusters
Problem
The current benchmark mostly measures whether the system can answer one phrasing of a question correctly. That is useful, but it still leaves a gap between benchmark performance and real user robustness. A system can quietly overfit visible benchmark wording while still failing on semantically equivalent paraphrases.
We need paraphrase clusters so the benchmark can measure robustness to wording variation, not just performance on one canonical prompt per intent.
Context
The cleaned text-to-SQL benchmark gives us a base set of logical query intents. That means we do not need to invent new intent categories from scratch. We need to expand selected cases into clusters of equivalent phrasings.
Paraphrase clusters should help answer questions like:
- does retrieval still find the right working set when wording changes
- does planning remain stable across phrasing changes
- does the generator depend on benchmark-specific wording quirks
This should remain benchmark-disciplined. The goal is not infinite paraphrase generation. The goal is a curated robustness slice.
Possible Solutions
- Recommended: create curated paraphrase clusters for a representative benchmark slice. Expand selected benchmark cases into 3-5 equivalent phrasings while keeping the same semantic target and gold answer expectations.
Why this is recommended:
- directly measures robustness
- keeps curation manageable
- avoids confusing new semantic cases with wording variation
- Generate paraphrases automatically at large scale and trust them.
Trade-off: cheap, but quality and equivalence drift become a problem quickly.
- Keep one phrasing per intent forever.
Trade-off: simplest, but it under-measures real-world robustness.
Plan
- Choose a representative benchmark slice across schemas and difficulty levels.
- Create paraphrase clusters that preserve the same intended answer but vary wording style, specificity, and ordering.
- Extend the benchmark schema so paraphrase siblings can be grouped under a shared intent identifier.
- Update eval analysis so runs can be summarized by both case and paraphrase cluster.
- Use the new slice to compare robustness of full-context, retrieval, and bounded non-one-shot approaches.
Success criteria
- a meaningful subset of benchmark cases has curated paraphrase clusters
- eval reports can compare robustness within a shared intent cluster
- the benchmark becomes harder to overfit by wording alone
Implementation Progress
Not started.
QA Exploration
- [x] QA exploration completed (or N/A for non-UI tasks)
N/A - benchmark curation task.
Review Feedback
No review feedback yet.
- [ ] Review cleared