Set up eval leaderboard dft project and dashboards

ID	MCP_ANALYST_AGENT-SET_UP_EVAL_LEADERBOARD_DFT_PROJECT_AND_DASHBOARDS
Status	completed
Priority	p1
Milestone	m2-internal-adoption-design-partners
Owner	data-ai-engineer-architect
Initiative	benchmark-driven-text-to-sql-and-discovery-evals
Completed by	dave
Completed	2026-03-18

Problem

Create a dft project inside the eval output directory with dashboard faces that visualize eval results as a leaderboard. Compare models, prompt versions, and context configurations side-by-side. Serve via dft serve so results are browsable locally after any eval run. This is the concrete deliverable for task 5 (persist eval outputs) — not a persistence layer, but actual Dataface dashboards over eval JSONL.

Context

What this replaces

Task 5 (persist eval outputs) was originally a separate task framed as a persistence layer. That task is now cancelled and merged here. The "persistence" is trivial — JSONL files on disk. The real deliverable is dashboards.

Unified eval directory structure

All eval code lives under apps/evals/ — one home for all eval types, one CLI, one leaderboard:

apps/evals/
├── __main__.py           # unified CLI: python -m apps.evals {sql,catalog,agent}
├── dataface.yml          # DuckDB source over output/ JSONL
├── faces/                # leaderboard dashboards (all eval types)
│   ├── overview.yml      # cross-type summary (pass rates per eval type)
│   ├── sql-leaderboard.yml
│   ├── catalog-leaderboard.yml
│   ├── agent-leaderboard.yml
│   └── failure-analysis.yml
├── data/                 # benchmark artifacts (checked in)
│   ├── benchmark.jsonl   # cleaned dbt SQL benchmark
│   └── canary.jsonl      # stratified subset for fast runs
├── output/               # eval results (gitignored)
│   ├── sql/              # text-to-SQL results
│   ├── catalog/          # catalog discovery results
│   └── agent/            # agent eval results (screenshots, scores)
├── sql/                  # text-to-SQL eval code
│   ├── runner.py
│   ├── scorer.py
│   ├── backends.py       # factory functions (make_raw_llm_backend, etc.)
│   └── types.py          # BenchmarkCase, GenerationResult, GenerateFn
├── catalog/              # catalog discovery eval code
│   ├── runner.py
│   ├── scorer.py         # IR metrics (recall@k, MRR)
│   └── prep.py           # extract expected tables from gold SQL
├── agent/                # agent/dashboard eval code (migrated from apps/a_lie/)
│   ├── runner.py         # generate dashboard from prompt
│   ├── screenshotter.py  # capture rendered output
│   ├── reviewer.py       # vision LLM scoring
│   ├── rubric.md
│   └── prompts/          # curated eval prompts
└── shared/               # shared across eval types
    ├── prep.py           # benchmark cleaning script
    └── reporting.py      # breakdown/aggregation helpers

The agent eval code (runner.py, screenshotter.py, reviewer.py, rubric.md) migrates from apps/a_lie/ — the A lIe app stays where it is as the demo app, but the eval infrastructure moves to its proper home in apps/evals/agent/.

Data flow

Eval runners write results to apps/evals/output/{sql,catalog,agent}/ (gitignored).
apps/evals/dataface.yml declares a DuckDB source. Dashboard queries use read_json_auto() to read JSONL from the output subdirs. This should work the same way file-backed CSV workflows already do. If a small adapter/helper gap shows up around JSONL, close it inside this task rather than treating it as a separate blocker.
Dashboard faces in apps/evals/faces/ query across all eval types.
dft serve from apps/evals/ renders the unified leaderboard.

If eval results need to be shared beyond the local machine, that's M2 scope.

Unified CLI

python -m apps.evals sql --backend raw_llm --model gpt-4o ...
python -m apps.evals catalog --limit 100 ...
python -m apps.evals agent --prompts apps/evals/agent/prompts/ ...
python -m apps.evals agent --prompt "show me revenue by region" ...

Reporting dimensions

SQL evals: backend, backend_metadata (model, provider, context level), schema, complexity, category. Catalog evals: retrieval method, k value, schema. Agent evals: model, prompt, overall/narrative/yaml/visual scores.

Dashboard ideas

Overview — cross-type summary: SQL pass rate, catalog recall@10, agent average score. One page to see overall AI quality.
SQL leaderboard — pass rate by backend/model, sortable by dimension.
SQL failure analysis — table of failed cases with question, gold SQL, generated SQL, failure reason.
Catalog leaderboard — recall@k and MRR by retrieval method.
Agent leaderboard — overall score by model, with per-prompt drill-down.
Context impact — lift per context level across SQL and agent evals.
Schema difficulty — which schemas are hardest? Pass rate by schema, sorted.

Dependencies

Depends on task 3 (SQL eval runner) for the SQL output schema.
Does not need to wait on the analytics repo bootstrap flow. apps/evals/ can be created directly in this repo; the analytics repo remains the canonical dft init proving ground.
DuckDB/file-backed querying already exists in Dataface. If JSONL needs one small missing piece, add it here.
Agent eval dashboards depend on the A lIe eval migration (M2 agent eval task).

Possible Solutions

Recommended - Put a small DuckDB project config in apps/evals/dataface.yml, keep the SQL as raw read_json_auto() queries over apps/evals/output/sql/**, and factor shared leaderboard queries into an underscore-prefixed query file imported by the visible faces. - Pros: no adapter work, one unified eval project, easy to add new faces, and the SQL stays close to the output schema. - Cons: face validation has to tolerate a query-only helper file and the dashboards only cover SQL until the catalog/agent outputs exist.
Flatten everything into standalone face files with duplicated SQL. - Pros: simplest validation story. - Cons: repeated query logic and harder to keep leaderboard dimensions consistent.
Build a new helper layer under apps/evals/ that materializes DuckDB views from JSONL before Dataface renders. - Pros: stronger separation between raw JSONL and dashboards. - Cons: unnecessary extra code for this task and violates the "minimal support changes" goal.

Plan

Create apps/evals/dataface.yml with an in-memory DuckDB source as the project default.
Add a shared underscore-prefixed SQL query file under apps/evals/faces/ that reads the SQL run summaries and results JSONL via read_json_auto().
Add the visible leaderboard faces for overview, SQL leaderboard, SQL failure analysis, context impact, and schema difficulty.
Add focused tests under tests/evals/ that validate the faces and render them against sample JSONL output in a temp project root.
Run just task validate, the focused eval tests, and just ci.

Implementation Progress

Started from a clean branch on codex/set-up-eval-leaderboard-dft-project-and-dashboards.
Added apps/evals/dataface.yml with a DuckDB :memory: source and board defaults.
Added shared SQL run queries in apps/evals/faces/_sql_eval_queries.yml using read_json_auto('output/sql/*/summary.json') and read_json_auto('output/sql/*/results.jsonl').
Added faces for overview, SQL leaderboard, SQL failure analysis, context impact, and schema difficulty.
Added tests/evals/test_leaderboard_dft.py to validate the face directory and smoke-render the dashboards against sample JSONL output in a temp project root.
Scoped DuckDB external access through an explicit allowlist (enable_external_access) so eval dashboards can read local JSONL without reopening arbitrary config knobs.
Loaded project sources from dataface.yml so the eval project source definition is visible to both compile and execute paths.
Restored read-only sandbox tests alongside the eval opt-in path to preserve the default security model.
Verified the dashboard project with a focused eval render test and the full repo just ci gate.

QA Exploration

[x] QA exploration completed — verify dashboards render via dft serve with sample eval output

Review Feedback

just review passed after the DuckDB config allowlist, project-source loading, and sandbox test adjustments.
Review sign-off: approved.
[x] Review cleared