Benchmark-Driven Text-to-SQL and Discovery Evals
Objective
Set up a Dataface-native eval system centered on the Fivetran dbt SQL dataset for text-to-SQL generation and catalog/search discovery. Start with a cleaned SQL-only benchmark, add deterministic scoring and agent runners, and emit structured JSON/result tables that Dataface boards can analyze.
Milestone placement
This initiative is intentionally scoped to M2, not M1. M1 only needs a usable analyst workflow in the Fivetran analytics environment; it does not require text-to-SQL quality to be systematically benchmarked or hardened yet. The eval system becomes necessary in M2, when text-to-SQL and discovery quality must become reliable enough for repeated internal use and design-partner hardening.
Code location
All eval code lives under apps/evals/ with subdirectories per eval type (sql/, catalog/, agent/), shared utilities (shared/), benchmark data (data/), results (output/), and leaderboard dashboards (faces/). One CLI entry point: python -m apps.evals {sql,catalog,agent}. See the leaderboard task for the full directory layout.
Tasks
Infrastructure (build the measurement system)
- Create cleaned dbt SQL benchmark artifact — P1 — Import raw dataset from cto-research, filter to standard SQL, produce cleaned + canary JSONL
- Extract shared text-to-SQL generation function — P1 — Factor out
generate_sql()so evals test the same code path as production - Build text-to-SQL eval runner and deterministic scorer — P1 — Hybrid deterministic + LLM-as-judge scoring, no warehouse needed
- Build bounded non-one-shot text-to-SQL stack for local evals — P1 — Experimental plan/candidate/repair/select backend to benchmark against the one-shot generator
- Add catalog discovery evals derived from SQL benchmark — P2 — IR evals (recall@k, MRR) for Dataface search/catalog tools
Visualization
- Set up eval leaderboard dft project and dashboards — P1 — Dataface dashboards over eval JSONL via DuckDB, served via
dft serve - ~~Persist eval outputs for Dataface analysis and boards~~ — CANCELLED — merged into leaderboard task above
Dependency graph
cleaned benchmark (task 2)
↓
extract generate_sql ─────→ eval runner (task 3)
↓
bounded non-one-shot backend
↓
leaderboard dashboards
↓
[experimentation initiative uses these]
Catalog discovery evals (task 4) depend on the cleaned benchmark but use a separate scorer.