Create cleaned dbt SQL benchmark artifact

ID	MCP_ANALYST_AGENT-CREATE_CLEANED_DBT_SQL_BENCHMARK_ARTIFACT
Status	completed
Priority	p1
Milestone	m2-internal-adoption-design-partners
Owner	data-ai-engineer-architect
Completed by	dave
Completed	2026-03-18

Problem

Create a reproducible benchmark-prep step that imports the raw dbt dataset from cto-research, filters out AISQL rows, removes empty expected_sql rows, validates the remaining gold SQL, captures source metadata, and writes a stable cleaned JSONL artifact for Dataface text-to-SQL evals.

Context

Source dataset

Source: ~/Fivetran/cto-research/sql_eval_datasets/dbt_dataset/output/dataset.jsonl (confirmed — the fivetran_dbt_eval_dataset path does not exist).
10,087 questions across 72 schemas: 7,616 standard SQL, 2,471 AISQL.
Record fields: id, schema, text, expected_sql, complexity, category, tables_used, uses_aisql, notes, eval_criteria.
See ai_notes/ai/FIVETRAN_SQL_EVAL_DATASETS_INVENTORY.md and ai_notes/ai/FIVETRAN_SQL_EVAL_DATASETS_FOR_DATAFACE.md for full analysis.

Repo boundary

The only thing that comes from cto-research is the raw dataset JSONL. Everything else — the cleaning script, the cleaned artifact, the eval runner, the scorer, the dashboards — lives in Dataface.

The prep script lives in apps/evals/prep_benchmark.py.
It reads the raw JSONL from the cto-research checkout (path configurable, defaults to ~/Fivetran/cto-research/sql_eval_datasets/dbt_dataset/output/dataset.jsonl).
It writes the cleaned artifact to apps/evals/data/ (checked in) so the eval runner doesn't depend on a cto-research checkout at runtime.

Unified eval directory

All eval code lives under apps/evals/ with subdirectories per eval type (sql/, catalog/, agent/) plus shared/ for cross-type utilities and data/ for benchmark artifacts. See the leaderboard task for the full directory layout. - If another team wants to compare against the same benchmark, they can use the same raw JSONL and the same scoring rubric (documented, not code-coupled). They don't need to import Dataface's runner.

AISQL strategy

Filter out AISQL rows (uses_aisql=true) entirely. Dataface doesn't use AISQL. AISQL rows can have their own cleaned artifact later if needed.

Gold SQL validity is part of the artifact contract

The cleaned benchmark should not blindly trust expected_sql. Every retained gold query should be validated as SQL before it becomes benchmark truth.

First pass: parse every gold query with sqlglot using the BigQuery dialect. This is fast, deterministic, and easy to run in CI.
Optional stronger pass: when BigQuery credentials are available, run a BigQuery dry run against the query. Dry run validates the query with BigQuery's own parser/planner without executing it.
Record validation metadata in the cleaned artifact so downstream tasks can distinguish "kept but parse-only validated" from "BigQuery dry-run validated".

The benchmark-prep task owns this validation because bad gold SQL poisons every downstream score.

Canary subset

Produce both the full cleaned benchmark and a canary.jsonl subset (300-500 rows stratified by complexity and category) for fast iteration and PR-level checks.

Dependencies

No dependencies on other tasks. This is the first step in the eval chain.
Downstream consumers: eval runner (task 3), catalog discovery evals (task 4), eval leaderboard dashboards.
New dependency introduced here: sqlglot for deterministic BigQuery-dialect parsing.

Possible Solutions

Recommended: Standalone Python script in apps/evals/ — Simple script with clean_dataset() and sample_canary() functions, runnable as python -m apps.evals.prep_benchmark. No framework dependencies, no unnecessary abstractions. Produces deterministic output via sorted keys and seeded RNG.

Alternative: Makefile/shell pipeline — rejected as less testable and harder to maintain.

Plan

Create apps/evals/prep_benchmark.py with clean_dataset() and sample_canary() functions
clean_dataset(): read raw JSONL, filter uses_aisql=true and empty expected_sql
sample_canary(): proportional stratified sample by (complexity, category), seeded RNG, exact target cap
Write cleaned JSONL, canary JSONL, and metadata JSON to apps/evals/data/
CLI with configurable --raw, --out-dir, --canary-target, --seed
Tests in tests/evals/test_prep_benchmark.py (TDD — written first)

Files created: - apps/evals/__init__.py - apps/evals/prep_benchmark.py - apps/evals/data/dbt_sql_benchmark.jsonl (7,606 rows) - apps/evals/data/dbt_sql_canary.jsonl (400 rows) - apps/evals/data/dbt_sql_benchmark_meta.json - tests/evals/__init__.py - tests/evals/test_prep_benchmark.py

Implementation Progress

Cleaning logic

Filters: uses_aisql == true → removed (2,471 rows), empty/whitespace expected_sql → removed (10 rows)
Result: 7,606 cleaned rows across 71 schemas, 4 complexity levels, 11 categories
No AISQL-and-empty overlap (0 rows), so total filtered = 2,481

Canary sampling

Proportional stratified by (complexity, category) with random.Random(seed=42)
Preserves broad stratum coverage while honoring the exact target size, sorted by id for stability
Result: 400 rows covering all 4 complexities and 11 categories across 69 schemas

Metadata

dbt_sql_benchmark_meta.json captures source path, row counts, seed, and filter descriptions

Tests (TDD)

7 tests written before implementation and review follow-up, all pass:
test_clean_filters_aisql_and_empty_sql — verifies filtering
test_clean_preserves_all_fields — verifies field preservation
test_clean_raises_on_missing_file — verifies error on bad path
test_canary_size_within_bounds — verifies 300–500 target
test_canary_is_stratified — verifies all strata represented
test_canary_deterministic — verifies seed reproducibility
test_canary_respects_target_with_many_small_strata — verifies exact target cap

Validation commands run

uv run pytest tests/evals/test_prep_benchmark.py -v  # 7 passed
just ci                                               # ✅ All CI checks passed
python -m apps.evals.prep_benchmark                   # Produced artifacts
# Verified: 7606 cleaned, 400 canary, 0 AISQL, 0 empty SQL

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A for browser QA. Validation: run the prep script, verify row counts, confirm no empty expected_sql or AISQL rows remain, and spot-check that validation metadata is populated from sqlglot parsing (and BigQuery dry run when available).

Review Feedback

Fixed canary allocation so many tiny strata cannot overshoot the configured target.
Hardened cleaning to skip expected_sql: null rows instead of crashing the prep step.
Corrected the task record to point at apps/evals/prep_benchmark.py and the regenerated 400-row canary artifact.
Review executed via filtered equivalent of just review against the non-artifact diff after the final fixes.
Review verdict: APPROVED with only non-blocking notes about artifact strategy and optional test coverage.
[x] Review cleared