Add business-semantic ambiguity slices to text-to-SQL benchmark

ID	MCP_ANALYST_AGENT-ADD_BUSINESS_SEMANTIC_AMBIGUITY_SLICES_TO_TEXT_TO_SQL_BENCHMARK
Status	not_started
Priority	p2
Milestone	m5-v1-2-launch
Owner	data-ai-engineer-architect

Problem

Many of the hardest real-world text-to-SQL failures are not schema-lookup failures at all. They come from business-semantic ambiguity: the schema may be easy to inspect, but the question still depends on choosing the right business interpretation such as gross vs net revenue, booked vs recognized revenue, active vs all customers, or first vs repeat orders.

If the benchmark mostly rewards straightforward structural mapping, it will overestimate system quality on the hardest analyst-style questions.

Context

The existing benchmark work focuses on cleaned SQL cases and future retrieval/planning supervision. That gives us a solid structural baseline, but it still under-represents business-definition ambiguity.

These ambiguity slices matter because they stress:

retrieval of business definitions, not just schema names
planner choice among plausible semantic targets
model tendency to guess the closest-looking metric

This task should add a focused slice of business-semantic cases rather than trying to solve all semantic modeling problems in one benchmark expansion.

Possible Solutions

Recommended: curate a benchmark slice specifically for business-semantic ambiguity. Add cases where multiple plausible metrics or interpretations exist and success depends on selecting the intended business meaning, not merely locating the right table.

Why this is recommended:

fills a major realism gap in the benchmark
creates a better testbed for future semantic retrieval work
remains compatible with the existing benchmark structure

Treat existing benchmark cases as sufficient.

Trade-off: simpler, but it leaves one of the most important real-world failure modes under-measured.

Fold semantic ambiguity into every benchmark case immediately.

Trade-off: too broad. A focused slice is easier to define and maintain first.

Plan

Identify the most common business-semantic ambiguity patterns worth representing.
Curate a small but high-signal case set spanning those patterns.
Add any needed benchmark metadata so these cases can be grouped and analyzed separately.
Use the slice to evaluate whether retrieval, planning, or business-definition context work is improving the right class of problems.
Keep the slice distinct from simpler schema-only cases so the benchmark can report both structural and semantic difficulty.

Example ambiguity types

gross vs net revenue
booked vs recognized revenue
active vs all customers
first purchase vs repeat purchase
order date vs ship date vs invoice date

Success criteria

the benchmark contains an explicit business-semantic ambiguity slice
eval reporting can isolate semantic ambiguity failures from simpler schema failures
future semantic grounding work has a clear target slice to improve against

Implementation Progress

Not started.

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A - benchmark curation task.

Review Feedback

No review feedback yet.

[ ] Review cleared