Add catalog discovery evals derived from SQL benchmark
Problem
Adapt the dbt SQL benchmark into search/catalog discovery eval cases by extracting expected tables from gold SQL and generating one or more search queries per case. Reuse the proven scoring model of recall@k, hit rate@k, and MRR, but wire it to Dataface search/catalog retrieval instead of ContextCatalog-specific runners.
Context
Repo boundary: this lives in Dataface
Like all eval tasks, this lives in the Dataface repo under apps/evals/catalog/. The eval prep step (apps/evals/catalog/prep.py — extract expected tables from gold SQL, generate search queries), the retrieval runner (apps/evals/catalog/runner.py), and the IR scorer (apps/evals/catalog/scorer.py) all live there. Results go to apps/evals/output/catalog/. The cleaned benchmark input comes from apps/evals/data/ (task 2).
The unified apps/evals/ CLI runs catalog evals via python -m apps.evals catalog ....
Scoring model: information retrieval metrics
This uses a completely different scoring model than the text-to-SQL eval (task 3):
- recall@k — what fraction of expected tables appear in the top-k results?
- hit rate@k — does at least one expected table appear in top-k?
- MRR (mean reciprocal rank) — how high does the first correct result rank?
Existing prior art
cto-research/context_catalog/evals/search_eval/prepare_dataset.py already does the core transformation: extract expected tables from gold SQL, generate search queries per table. Port this approach.
What this evaluates
The retrieval step before generation — can Dataface's catalog/search surface the right tables given a natural-language question? Text-to-SQL usually fails at retrieval, not generation. If catalog search can't find the right table, the SQL generator never had a chance.
The Dataface search surface is search_dashboards in dataface/ai/mcp/search.py and the catalog tool in dataface/ai/mcp/tools.py. This eval tests whether those tools (or future dedicated table-search tools) return the expected tables.
Dependencies
- Input comes from task 2 (cleaned benchmark in
apps/evals/data/). - Does not depend on task 3's runner framework — this is a separate, simpler eval loop with its own retrieval-specific scorer.
Possible Solutions
- Recommended: derive a standalone offline IR benchmark from the cleaned SQL benchmark, build a catalog corpus from the benchmark's gold SQL/table usage, and score retrieval with deterministic lexical ranking plus recall@k / hit rate@k / MRR.
- Alternative: call a live catalog/search tool during eval runs. Rejected because the task explicitly needs to remain offline and independent of warehouse execution.
Plan
- Add
apps/evals/catalog/with prep, retrieval backend, runner, scorer, and CLI wiring. - Reuse the cleaned SQL benchmark under
apps/evals/data/to generate catalog benchmark and table-corpus JSONL artifacts. - Add shared IR aggregation helpers under
apps/evals/shared/. - Add focused tests for prep, backend ranking, scorer metrics, and runner output.
Implementation Progress
- Added a catalog eval package under
apps/evals/catalog/with: prep.pyto extract expected tables from gold SQL and emit catalog benchmark/table-corpus JSONL artifactsbackends.pywith a deterministic lexical retriever over the offline corpusrunner.pywith an async offline eval loopscorer.pyfor recall@k, hit rate@k, and MRRcli.pyplus unifiedapps/evals/cli.pywiring- Added
apps/evals/shared/ir_reporting.pyso leaderboard/reporting code can aggregate IR metrics by slice. - Added on-demand catalog artifact generation from
apps/evals/data/dbt_sql_benchmark.jsonlso the derived catalog benchmark/table corpus can be regenerated locally without checking multi-megabyte duplicates into git. - Validated the new catalog package with focused tests and a sample end-to-end run against 5 benchmark cases.
QA Exploration
- [x] QA exploration completed (or N/A for non-UI tasks)
N/A for browser QA. Validation: run against a sample of benchmark cases, verify recall@k and MRR outputs match manual spot-checks.
Review Feedback
- [ ] Review cleared