Add schema-linking supervision to text-to-SQL benchmark cases

ID	MCP_ANALYST_AGENT-ADD_SCHEMA_LINKING_SUPERVISION_TO_TEXT_TO_SQL_BENCHMARK_CASES
Status	not_started
Priority	p2
Milestone	m5-v1-2-launch
Owner	data-ai-engineer-architect

Problem

Retrieval-side gold labels tell us whether the right tables and columns were present overall, but they still do not show how the question maps onto those schema elements. For planning and retrieval diagnostics, we eventually need a finer-grained view of which parts of the question should connect to which tables, columns, metrics, and filters.

Without schema-linking supervision, it remains difficult to evaluate whether a planner or retriever understood the user's question structure or merely happened to surface the right objects.

Context

This task builds on future benchmark enrichment such as retrieval-side gold labels. It is a deeper annotation layer that is most useful once the team already has:

benchmark cases with stable semantic targets
planning outputs worth comparing
retrieval/bundle metrics that need more diagnostic power

The supervision does not need to be perfect or exhaustive for every token. It needs to be useful enough to evaluate schema-linking quality on representative cases.

Possible Solutions

Recommended: add lightweight schema-linking annotations for a representative benchmark slice. Annotate key question spans to their intended tables, columns, metrics, and filters so planning and retrieval components can be scored more directly.

Why this is recommended:

creates a clear supervision target for schema linking
helps explain where planning or retrieval misunderstood the question
can be introduced incrementally on a high-value slice

Infer schema linking only from final SQL.

Trade-off: cheaper, but too indirect for diagnosing retrieval and planning behavior.

Try to annotate the entire benchmark exhaustively before using it.

Trade-off: too expensive. A representative slice is a better starting point.

Plan

Define a minimal schema-linking annotation contract for benchmark cases.
Pick a representative slice where schema-linking quality matters most.
Annotate key spans for: - tables - columns - metric concepts - filter values or dimensions
Add helper scoring utilities that compare planner/retriever outputs to those annotations.
Use the slice for deeper diagnostics on retrieval and planning experiments.

Success criteria

at least one benchmark slice contains usable schema-linking supervision
retrieval and planning components can be scored more directly than final SQL alone allows
later semantic and planner work has a clearer supervision target

Implementation Progress

Not started.

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A - benchmark annotation task.

Review Feedback

No review feedback yet.

[ ] Review cleared