Add catalog discovery evals derived from SQL benchmark

ID	MCP_ANALYST_AGENT-ADD_CATALOG_DISCOVERY_EVALS_DERIVED_FROM_SQL_BENCHMARK
Status	completed
Priority	p2
Milestone	m2-internal-adoption-design-partners
Owner	data-ai-engineer-architect
Completed by	dave
Completed	2026-03-18

Problem

Adapt the dbt SQL benchmark into search/catalog discovery eval cases by extracting expected tables from gold SQL and generating one or more search queries per case. Reuse the proven scoring model of recall@k, hit rate@k, and MRR, but wire it to Dataface search/catalog retrieval instead of ContextCatalog-specific runners.

Context

Repo boundary: this lives in Dataface

Like all eval tasks, this lives in the Dataface repo under apps/evals/catalog/. The eval prep step (apps/evals/catalog/prep.py — extract expected tables from gold SQL, generate search queries), the retrieval runner (apps/evals/catalog/runner.py), and the IR scorer (apps/evals/catalog/scorer.py) all live there. Results go to apps/evals/output/catalog/. The cleaned benchmark input comes from apps/evals/data/ (task 2).

The unified apps/evals/ CLI runs catalog evals via python -m apps.evals catalog ....

Scoring model: information retrieval metrics

This uses a completely different scoring model than the text-to-SQL eval (task 3):

recall@k — what fraction of expected tables appear in the top-k results?
hit rate@k — does at least one expected table appear in top-k?
MRR (mean reciprocal rank) — how high does the first correct result rank?

Existing prior art

cto-research/context_catalog/evals/search_eval/prepare_dataset.py already does the core transformation: extract expected tables from gold SQL, generate search queries per table. Port this approach.

What this evaluates

The retrieval step before generation — can Dataface's catalog/search surface the right tables given a natural-language question? Text-to-SQL usually fails at retrieval, not generation. If catalog search can't find the right table, the SQL generator never had a chance.

The Dataface search surface is search_dashboards in dataface/ai/mcp/search.py and the catalog tool in dataface/ai/mcp/tools.py. This eval tests whether those tools (or future dedicated table-search tools) return the expected tables.

Dependencies

Input comes from task 2 (cleaned benchmark in apps/evals/data/).
Does not depend on task 3's runner framework — this is a separate, simpler eval loop with its own retrieval-specific scorer.

Possible Solutions

Recommended: derive a standalone offline IR benchmark from the cleaned SQL benchmark, build a catalog corpus from the benchmark's gold SQL/table usage, and score retrieval with deterministic lexical ranking plus recall@k / hit rate@k / MRR.
Alternative: call a live catalog/search tool during eval runs. Rejected because the task explicitly needs to remain offline and independent of warehouse execution.

Plan

Add apps/evals/catalog/ with prep, retrieval backend, runner, scorer, and CLI wiring.
Reuse the cleaned SQL benchmark under apps/evals/data/ to generate catalog benchmark and table-corpus JSONL artifacts.
Add shared IR aggregation helpers under apps/evals/shared/.
Add focused tests for prep, backend ranking, scorer metrics, and runner output.

Implementation Progress

Added a catalog eval package under apps/evals/catalog/ with:
prep.py to extract expected tables from gold SQL and emit catalog benchmark/table-corpus JSONL artifacts
backends.py with a deterministic lexical retriever over the offline corpus
runner.py with an async offline eval loop
scorer.py for recall@k, hit rate@k, and MRR
cli.py plus unified apps/evals/cli.py wiring
Added apps/evals/shared/ir_reporting.py so leaderboard/reporting code can aggregate IR metrics by slice.
Added on-demand catalog artifact generation from apps/evals/data/dbt_sql_benchmark.jsonl so the derived catalog benchmark/table corpus can be regenerated locally without checking multi-megabyte duplicates into git.
Validated the new catalog package with focused tests and a sample end-to-end run against 5 benchmark cases.

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A for browser QA. Validation: run against a sample of benchmark cases, verify recall@k and MRR outputs match manual spot-checks.

Review Feedback

[ ] Review cleared