Add business-semantic ambiguity slices to text-to-SQL benchmark
Problem
Many of the hardest real-world text-to-SQL failures are not schema-lookup failures at all. They come from business-semantic ambiguity: the schema may be easy to inspect, but the question still depends on choosing the right business interpretation such as gross vs net revenue, booked vs recognized revenue, active vs all customers, or first vs repeat orders.
If the benchmark mostly rewards straightforward structural mapping, it will overestimate system quality on the hardest analyst-style questions.
Context
The existing benchmark work focuses on cleaned SQL cases and future retrieval/planning supervision. That gives us a solid structural baseline, but it still under-represents business-definition ambiguity.
These ambiguity slices matter because they stress:
- retrieval of business definitions, not just schema names
- planner choice among plausible semantic targets
- model tendency to guess the closest-looking metric
This task should add a focused slice of business-semantic cases rather than trying to solve all semantic modeling problems in one benchmark expansion.
Possible Solutions
- Recommended: curate a benchmark slice specifically for business-semantic ambiguity. Add cases where multiple plausible metrics or interpretations exist and success depends on selecting the intended business meaning, not merely locating the right table.
Why this is recommended:
- fills a major realism gap in the benchmark
- creates a better testbed for future semantic retrieval work
- remains compatible with the existing benchmark structure
- Treat existing benchmark cases as sufficient.
Trade-off: simpler, but it leaves one of the most important real-world failure modes under-measured.
- Fold semantic ambiguity into every benchmark case immediately.
Trade-off: too broad. A focused slice is easier to define and maintain first.
Plan
- Identify the most common business-semantic ambiguity patterns worth representing.
- Curate a small but high-signal case set spanning those patterns.
- Add any needed benchmark metadata so these cases can be grouped and analyzed separately.
- Use the slice to evaluate whether retrieval, planning, or business-definition context work is improving the right class of problems.
- Keep the slice distinct from simpler schema-only cases so the benchmark can report both structural and semantic difficulty.
Example ambiguity types
- gross vs net revenue
- booked vs recognized revenue
- active vs all customers
- first purchase vs repeat purchase
- order date vs ship date vs invoice date
Success criteria
- the benchmark contains an explicit business-semantic ambiguity slice
- eval reporting can isolate semantic ambiguity failures from simpler schema failures
- future semantic grounding work has a clear target slice to improve against
Implementation Progress
Not started.
QA Exploration
- [x] QA exploration completed (or N/A for non-UI tasks)
N/A - benchmark curation task.
Review Feedback
No review feedback yet.
- [ ] Review cleared