Dataface Tasks

Add catalog discovery evals derived from SQL benchmark

IDMCP_ANALYST_AGENT-ADD_CATALOG_DISCOVERY_EVALS_DERIVED_FROM_SQL_BENCHMARK
Statuscompleted
Priorityp2
Milestonem2-internal-adoption-design-partners
Ownerdata-ai-engineer-architect
Completed bydave
Completed2026-03-18

Problem

Adapt the dbt SQL benchmark into search/catalog discovery eval cases by extracting expected tables from gold SQL and generating one or more search queries per case. Reuse the proven scoring model of recall@k, hit rate@k, and MRR, but wire it to Dataface search/catalog retrieval instead of ContextCatalog-specific runners.

Context

Repo boundary: this lives in Dataface

Like all eval tasks, this lives in the Dataface repo under apps/evals/catalog/. The eval prep step (apps/evals/catalog/prep.py — extract expected tables from gold SQL, generate search queries), the retrieval runner (apps/evals/catalog/runner.py), and the IR scorer (apps/evals/catalog/scorer.py) all live there. Results go to apps/evals/output/catalog/. The cleaned benchmark input comes from apps/evals/data/ (task 2).

The unified apps/evals/ CLI runs catalog evals via python -m apps.evals catalog ....

Scoring model: information retrieval metrics

This uses a completely different scoring model than the text-to-SQL eval (task 3):

  • recall@k — what fraction of expected tables appear in the top-k results?
  • hit rate@k — does at least one expected table appear in top-k?
  • MRR (mean reciprocal rank) — how high does the first correct result rank?

Existing prior art

cto-research/context_catalog/evals/search_eval/prepare_dataset.py already does the core transformation: extract expected tables from gold SQL, generate search queries per table. Port this approach.

What this evaluates

The retrieval step before generation — can Dataface's catalog/search surface the right tables given a natural-language question? Text-to-SQL usually fails at retrieval, not generation. If catalog search can't find the right table, the SQL generator never had a chance.

The Dataface search surface is search_dashboards in dataface/ai/mcp/search.py and the catalog tool in dataface/ai/mcp/tools.py. This eval tests whether those tools (or future dedicated table-search tools) return the expected tables.

Dependencies

  • Input comes from task 2 (cleaned benchmark in apps/evals/data/).
  • Does not depend on task 3's runner framework — this is a separate, simpler eval loop with its own retrieval-specific scorer.

Possible Solutions

  • Recommended: derive a standalone offline IR benchmark from the cleaned SQL benchmark, build a catalog corpus from the benchmark's gold SQL/table usage, and score retrieval with deterministic lexical ranking plus recall@k / hit rate@k / MRR.
  • Alternative: call a live catalog/search tool during eval runs. Rejected because the task explicitly needs to remain offline and independent of warehouse execution.

Plan

  • Add apps/evals/catalog/ with prep, retrieval backend, runner, scorer, and CLI wiring.
  • Reuse the cleaned SQL benchmark under apps/evals/data/ to generate catalog benchmark and table-corpus JSONL artifacts.
  • Add shared IR aggregation helpers under apps/evals/shared/.
  • Add focused tests for prep, backend ranking, scorer metrics, and runner output.

Implementation Progress

  • Added a catalog eval package under apps/evals/catalog/ with:
  • prep.py to extract expected tables from gold SQL and emit catalog benchmark/table-corpus JSONL artifacts
  • backends.py with a deterministic lexical retriever over the offline corpus
  • runner.py with an async offline eval loop
  • scorer.py for recall@k, hit rate@k, and MRR
  • cli.py plus unified apps/evals/cli.py wiring
  • Added apps/evals/shared/ir_reporting.py so leaderboard/reporting code can aggregate IR metrics by slice.
  • Added on-demand catalog artifact generation from apps/evals/data/dbt_sql_benchmark.jsonl so the derived catalog benchmark/table corpus can be regenerated locally without checking multi-megabyte duplicates into git.
  • Validated the new catalog package with focused tests and a sample end-to-end run against 5 benchmark cases.

QA Exploration

  • [x] QA exploration completed (or N/A for non-UI tasks)

N/A for browser QA. Validation: run against a sample of benchmark cases, verify recall@k and MRR outputs match manual spot-checks.

Review Feedback

  • [ ] Review cleared