Dataface Tasks

Add business-semantic ambiguity slices to text-to-SQL benchmark

IDMCP_ANALYST_AGENT-ADD_BUSINESS_SEMANTIC_AMBIGUITY_SLICES_TO_TEXT_TO_SQL_BENCHMARK
Statusnot_started
Priorityp2
Milestonem5-v1-2-launch
Ownerdata-ai-engineer-architect

Problem

Many of the hardest real-world text-to-SQL failures are not schema-lookup failures at all. They come from business-semantic ambiguity: the schema may be easy to inspect, but the question still depends on choosing the right business interpretation such as gross vs net revenue, booked vs recognized revenue, active vs all customers, or first vs repeat orders.

If the benchmark mostly rewards straightforward structural mapping, it will overestimate system quality on the hardest analyst-style questions.

Context

The existing benchmark work focuses on cleaned SQL cases and future retrieval/planning supervision. That gives us a solid structural baseline, but it still under-represents business-definition ambiguity.

These ambiguity slices matter because they stress:

  • retrieval of business definitions, not just schema names
  • planner choice among plausible semantic targets
  • model tendency to guess the closest-looking metric

This task should add a focused slice of business-semantic cases rather than trying to solve all semantic modeling problems in one benchmark expansion.

Possible Solutions

  1. Recommended: curate a benchmark slice specifically for business-semantic ambiguity. Add cases where multiple plausible metrics or interpretations exist and success depends on selecting the intended business meaning, not merely locating the right table.

Why this is recommended:

  • fills a major realism gap in the benchmark
  • creates a better testbed for future semantic retrieval work
  • remains compatible with the existing benchmark structure
  1. Treat existing benchmark cases as sufficient.

Trade-off: simpler, but it leaves one of the most important real-world failure modes under-measured.

  1. Fold semantic ambiguity into every benchmark case immediately.

Trade-off: too broad. A focused slice is easier to define and maintain first.

Plan

  1. Identify the most common business-semantic ambiguity patterns worth representing.
  2. Curate a small but high-signal case set spanning those patterns.
  3. Add any needed benchmark metadata so these cases can be grouped and analyzed separately.
  4. Use the slice to evaluate whether retrieval, planning, or business-definition context work is improving the right class of problems.
  5. Keep the slice distinct from simpler schema-only cases so the benchmark can report both structural and semantic difficulty.

Example ambiguity types

  • gross vs net revenue
  • booked vs recognized revenue
  • active vs all customers
  • first purchase vs repeat purchase
  • order date vs ship date vs invoice date

Success criteria

  • the benchmark contains an explicit business-semantic ambiguity slice
  • eval reporting can isolate semantic ambiguity failures from simpler schema failures
  • future semantic grounding work has a clear target slice to improve against

Implementation Progress

Not started.

QA Exploration

  • [x] QA exploration completed (or N/A for non-UI tasks)

N/A - benchmark curation task.

Review Feedback

No review feedback yet.

  • [ ] Review cleared