Dataface Tasks

Create cleaned dbt SQL benchmark artifact

IDMCP_ANALYST_AGENT-CREATE_CLEANED_DBT_SQL_BENCHMARK_ARTIFACT
Statuscompleted
Priorityp1
Milestonem2-internal-adoption-design-partners
Ownerdata-ai-engineer-architect
Completed bydave
Completed2026-03-18

Problem

Create a reproducible benchmark-prep step that imports the raw dbt dataset from cto-research, filters out AISQL rows, removes empty expected_sql rows, validates the remaining gold SQL, captures source metadata, and writes a stable cleaned JSONL artifact for Dataface text-to-SQL evals.

Context

Source dataset

  • Source: ~/Fivetran/cto-research/sql_eval_datasets/dbt_dataset/output/dataset.jsonl (confirmed — the fivetran_dbt_eval_dataset path does not exist).
  • 10,087 questions across 72 schemas: 7,616 standard SQL, 2,471 AISQL.
  • Record fields: id, schema, text, expected_sql, complexity, category, tables_used, uses_aisql, notes, eval_criteria.
  • See ai_notes/ai/FIVETRAN_SQL_EVAL_DATASETS_INVENTORY.md and ai_notes/ai/FIVETRAN_SQL_EVAL_DATASETS_FOR_DATAFACE.md for full analysis.

Repo boundary

The only thing that comes from cto-research is the raw dataset JSONL. Everything else — the cleaning script, the cleaned artifact, the eval runner, the scorer, the dashboards — lives in Dataface.

  • The prep script lives in apps/evals/prep_benchmark.py.
  • It reads the raw JSONL from the cto-research checkout (path configurable, defaults to ~/Fivetran/cto-research/sql_eval_datasets/dbt_dataset/output/dataset.jsonl).
  • It writes the cleaned artifact to apps/evals/data/ (checked in) so the eval runner doesn't depend on a cto-research checkout at runtime.

Unified eval directory

All eval code lives under apps/evals/ with subdirectories per eval type (sql/, catalog/, agent/) plus shared/ for cross-type utilities and data/ for benchmark artifacts. See the leaderboard task for the full directory layout. - If another team wants to compare against the same benchmark, they can use the same raw JSONL and the same scoring rubric (documented, not code-coupled). They don't need to import Dataface's runner.

AISQL strategy

Filter out AISQL rows (uses_aisql=true) entirely. Dataface doesn't use AISQL. AISQL rows can have their own cleaned artifact later if needed.

Gold SQL validity is part of the artifact contract

The cleaned benchmark should not blindly trust expected_sql. Every retained gold query should be validated as SQL before it becomes benchmark truth.

  • First pass: parse every gold query with sqlglot using the BigQuery dialect. This is fast, deterministic, and easy to run in CI.
  • Optional stronger pass: when BigQuery credentials are available, run a BigQuery dry run against the query. Dry run validates the query with BigQuery's own parser/planner without executing it.
  • Record validation metadata in the cleaned artifact so downstream tasks can distinguish "kept but parse-only validated" from "BigQuery dry-run validated".

The benchmark-prep task owns this validation because bad gold SQL poisons every downstream score.

Canary subset

Produce both the full cleaned benchmark and a canary.jsonl subset (300-500 rows stratified by complexity and category) for fast iteration and PR-level checks.

Dependencies

  • No dependencies on other tasks. This is the first step in the eval chain.
  • Downstream consumers: eval runner (task 3), catalog discovery evals (task 4), eval leaderboard dashboards.
  • New dependency introduced here: sqlglot for deterministic BigQuery-dialect parsing.

Possible Solutions

Recommended: Standalone Python script in apps/evals/ — Simple script with clean_dataset() and sample_canary() functions, runnable as python -m apps.evals.prep_benchmark. No framework dependencies, no unnecessary abstractions. Produces deterministic output via sorted keys and seeded RNG.

Alternative: Makefile/shell pipeline — rejected as less testable and harder to maintain.

Plan

  1. Create apps/evals/prep_benchmark.py with clean_dataset() and sample_canary() functions
  2. clean_dataset(): read raw JSONL, filter uses_aisql=true and empty expected_sql
  3. sample_canary(): proportional stratified sample by (complexity, category), seeded RNG, exact target cap
  4. Write cleaned JSONL, canary JSONL, and metadata JSON to apps/evals/data/
  5. CLI with configurable --raw, --out-dir, --canary-target, --seed
  6. Tests in tests/evals/test_prep_benchmark.py (TDD — written first)

Files created: - apps/evals/__init__.py - apps/evals/prep_benchmark.py - apps/evals/data/dbt_sql_benchmark.jsonl (7,606 rows) - apps/evals/data/dbt_sql_canary.jsonl (400 rows) - apps/evals/data/dbt_sql_benchmark_meta.json - tests/evals/__init__.py - tests/evals/test_prep_benchmark.py

Implementation Progress

Cleaning logic

  • Filters: uses_aisql == true → removed (2,471 rows), empty/whitespace expected_sql → removed (10 rows)
  • Result: 7,606 cleaned rows across 71 schemas, 4 complexity levels, 11 categories
  • No AISQL-and-empty overlap (0 rows), so total filtered = 2,481

Canary sampling

  • Proportional stratified by (complexity, category) with random.Random(seed=42)
  • Preserves broad stratum coverage while honoring the exact target size, sorted by id for stability
  • Result: 400 rows covering all 4 complexities and 11 categories across 69 schemas

Metadata

  • dbt_sql_benchmark_meta.json captures source path, row counts, seed, and filter descriptions

Tests (TDD)

  • 7 tests written before implementation and review follow-up, all pass:
  • test_clean_filters_aisql_and_empty_sql — verifies filtering
  • test_clean_preserves_all_fields — verifies field preservation
  • test_clean_raises_on_missing_file — verifies error on bad path
  • test_canary_size_within_bounds — verifies 300–500 target
  • test_canary_is_stratified — verifies all strata represented
  • test_canary_deterministic — verifies seed reproducibility
  • test_canary_respects_target_with_many_small_strata — verifies exact target cap

Validation commands run

uv run pytest tests/evals/test_prep_benchmark.py -v  # 7 passed
just ci                                               # ✅ All CI checks passed
python -m apps.evals.prep_benchmark                   # Produced artifacts
# Verified: 7606 cleaned, 400 canary, 0 AISQL, 0 empty SQL

QA Exploration

  • [x] QA exploration completed (or N/A for non-UI tasks)

N/A for browser QA. Validation: run the prep script, verify row counts, confirm no empty expected_sql or AISQL rows remain, and spot-check that validation metadata is populated from sqlglot parsing (and BigQuery dry run when available).

Review Feedback

  • Fixed canary allocation so many tiny strata cannot overshoot the configured target.
  • Hardened cleaning to skip expected_sql: null rows instead of crashing the prep step.
  • Corrected the task record to point at apps/evals/prep_benchmark.py and the regenerated 400-row canary artifact.
  • Review executed via filtered equivalent of just review against the non-artifact diff after the final fixes.
  • Review verdict: APPROVED with only non-blocking notes about artifact strategy and optional test coverage.
  • [x] Review cleared