tasks/workstreams/mcp-analyst-agent/initiatives/benchmark-driven-text-to-sql-and-discovery-evals/index.md


type: initiative slug: benchmark-driven-text-to-sql-and-discovery-evals title: Benchmark-Driven Text-to-SQL and Discovery Evals workstream: mcp-analyst-agent owner: data-ai-engineer-architect status: planned milestone: m2-internal-adoption-design-partners


Benchmark-Driven Text-to-SQL and Discovery Evals

{{ initiative_progress_bar("mcp-analyst-agent", "benchmark-driven-text-to-sql-and-discovery-evals") }}

Objective

Set up a Dataface-native eval system centered on the Fivetran dbt SQL dataset for text-to-SQL generation and catalog/search discovery. Start with a cleaned SQL-only benchmark, add deterministic scoring and agent runners, and emit structured JSON/result tables that Dataface boards can analyze.

Milestone placement

This initiative is intentionally scoped to M2, not M1. M1 only needs a usable analyst workflow in the Fivetran analytics environment; it does not require text-to-SQL quality to be systematically benchmarked or hardened yet. The eval system becomes necessary in M2, when text-to-SQL and discovery quality must become reliable enough for repeated internal use and design-partner hardening.

Code location

All eval code lives under apps/evals/ with subdirectories per eval type (sql/, catalog/, agent/), shared utilities (shared/), benchmark data (data/), results (output/), and leaderboard dashboards (faces/). One CLI entry point: python -m apps.evals {sql,catalog,agent}. See the leaderboard task for the full directory layout.

Tasks

Infrastructure (build the measurement system)

Visualization

Dependency graph

cleaned benchmark (task 2)
  ↓
extract generate_sql ─────→ eval runner (task 3)
                             ↓
                 bounded non-one-shot backend
                              ↓
                         leaderboard dashboards
                              ↓
                    [experimentation initiative uses these]

Catalog discovery evals (task 4) depend on the cleaned benchmark but use a separate scorer.