Create cleaned dbt SQL benchmark artifact
Problem
Create a reproducible benchmark-prep step that imports the raw dbt dataset from cto-research, filters out AISQL rows, removes empty expected_sql rows, validates the remaining gold SQL, captures source metadata, and writes a stable cleaned JSONL artifact for Dataface text-to-SQL evals.
Context
Source dataset
- Source:
~/Fivetran/cto-research/sql_eval_datasets/dbt_dataset/output/dataset.jsonl(confirmed — thefivetran_dbt_eval_datasetpath does not exist). - 10,087 questions across 72 schemas: 7,616 standard SQL, 2,471 AISQL.
- Record fields:
id,schema,text,expected_sql,complexity,category,tables_used,uses_aisql,notes,eval_criteria. - See
ai_notes/ai/FIVETRAN_SQL_EVAL_DATASETS_INVENTORY.mdandai_notes/ai/FIVETRAN_SQL_EVAL_DATASETS_FOR_DATAFACE.mdfor full analysis.
Repo boundary
The only thing that comes from cto-research is the raw dataset JSONL. Everything else — the cleaning script, the cleaned artifact, the eval runner, the scorer, the dashboards — lives in Dataface.
- The prep script lives in
apps/evals/prep_benchmark.py. - It reads the raw JSONL from the cto-research checkout (path configurable, defaults to
~/Fivetran/cto-research/sql_eval_datasets/dbt_dataset/output/dataset.jsonl). - It writes the cleaned artifact to
apps/evals/data/(checked in) so the eval runner doesn't depend on a cto-research checkout at runtime.
Unified eval directory
All eval code lives under apps/evals/ with subdirectories per eval type (sql/, catalog/, agent/) plus shared/ for cross-type utilities and data/ for benchmark artifacts. See the leaderboard task for the full directory layout.
- If another team wants to compare against the same benchmark, they can use the same raw JSONL and the same scoring rubric (documented, not code-coupled). They don't need to import Dataface's runner.
AISQL strategy
Filter out AISQL rows (uses_aisql=true) entirely. Dataface doesn't use AISQL. AISQL rows can have their own cleaned artifact later if needed.
Gold SQL validity is part of the artifact contract
The cleaned benchmark should not blindly trust expected_sql. Every retained gold query should be validated as SQL before it becomes benchmark truth.
- First pass: parse every gold query with
sqlglotusing the BigQuery dialect. This is fast, deterministic, and easy to run in CI. - Optional stronger pass: when BigQuery credentials are available, run a BigQuery dry run against the query. Dry run validates the query with BigQuery's own parser/planner without executing it.
- Record validation metadata in the cleaned artifact so downstream tasks can distinguish "kept but parse-only validated" from "BigQuery dry-run validated".
The benchmark-prep task owns this validation because bad gold SQL poisons every downstream score.
Canary subset
Produce both the full cleaned benchmark and a canary.jsonl subset (300-500 rows stratified by complexity and category) for fast iteration and PR-level checks.
Dependencies
- No dependencies on other tasks. This is the first step in the eval chain.
- Downstream consumers: eval runner (task 3), catalog discovery evals (task 4), eval leaderboard dashboards.
- New dependency introduced here:
sqlglotfor deterministic BigQuery-dialect parsing.
Possible Solutions
Recommended: Standalone Python script in apps/evals/ — Simple script with clean_dataset() and sample_canary() functions, runnable as python -m apps.evals.prep_benchmark. No framework dependencies, no unnecessary abstractions. Produces deterministic output via sorted keys and seeded RNG.
Alternative: Makefile/shell pipeline — rejected as less testable and harder to maintain.
Plan
- Create
apps/evals/prep_benchmark.pywithclean_dataset()andsample_canary()functions clean_dataset(): read raw JSONL, filteruses_aisql=trueand emptyexpected_sqlsample_canary(): proportional stratified sample by (complexity, category), seeded RNG, exact target cap- Write cleaned JSONL, canary JSONL, and metadata JSON to
apps/evals/data/ - CLI with configurable
--raw,--out-dir,--canary-target,--seed - Tests in
tests/evals/test_prep_benchmark.py(TDD — written first)
Files created:
- apps/evals/__init__.py
- apps/evals/prep_benchmark.py
- apps/evals/data/dbt_sql_benchmark.jsonl (7,606 rows)
- apps/evals/data/dbt_sql_canary.jsonl (400 rows)
- apps/evals/data/dbt_sql_benchmark_meta.json
- tests/evals/__init__.py
- tests/evals/test_prep_benchmark.py
Implementation Progress
Cleaning logic
- Filters:
uses_aisql == true→ removed (2,471 rows), empty/whitespaceexpected_sql→ removed (10 rows) - Result: 7,606 cleaned rows across 71 schemas, 4 complexity levels, 11 categories
- No AISQL-and-empty overlap (0 rows), so total filtered = 2,481
Canary sampling
- Proportional stratified by (complexity, category) with
random.Random(seed=42) - Preserves broad stratum coverage while honoring the exact target size, sorted by id for stability
- Result: 400 rows covering all 4 complexities and 11 categories across 69 schemas
Metadata
dbt_sql_benchmark_meta.jsoncaptures source path, row counts, seed, and filter descriptions
Tests (TDD)
- 7 tests written before implementation and review follow-up, all pass:
test_clean_filters_aisql_and_empty_sql— verifies filteringtest_clean_preserves_all_fields— verifies field preservationtest_clean_raises_on_missing_file— verifies error on bad pathtest_canary_size_within_bounds— verifies 300–500 targettest_canary_is_stratified— verifies all strata representedtest_canary_deterministic— verifies seed reproducibilitytest_canary_respects_target_with_many_small_strata— verifies exact target cap
Validation commands run
uv run pytest tests/evals/test_prep_benchmark.py -v # 7 passed
just ci # ✅ All CI checks passed
python -m apps.evals.prep_benchmark # Produced artifacts
# Verified: 7606 cleaned, 400 canary, 0 AISQL, 0 empty SQL
QA Exploration
- [x] QA exploration completed (or N/A for non-UI tasks)
N/A for browser QA. Validation: run the prep script, verify row counts, confirm no empty expected_sql or AISQL rows remain, and spot-check that validation metadata is populated from sqlglot parsing (and BigQuery dry run when available).
Review Feedback
- Fixed canary allocation so many tiny strata cannot overshoot the configured target.
- Hardened cleaning to skip
expected_sql: nullrows instead of crashing the prep step. - Corrected the task record to point at
apps/evals/prep_benchmark.pyand the regenerated 400-row canary artifact. - Review executed via filtered equivalent of
just reviewagainst the non-artifact diff after the final fixes. - Review verdict: APPROVED with only non-blocking notes about artifact strategy and optional test coverage.
- [x] Review cleared