Set up eval leaderboard dft project and dashboards
Problem
Create a dft project inside the eval output directory with dashboard faces that visualize eval results as a leaderboard. Compare models, prompt versions, and context configurations side-by-side. Serve via dft serve so results are browsable locally after any eval run. This is the concrete deliverable for task 5 (persist eval outputs) — not a persistence layer, but actual Dataface dashboards over eval JSONL.
Context
What this replaces
Task 5 (persist eval outputs) was originally a separate task framed as a persistence layer. That task is now cancelled and merged here. The "persistence" is trivial — JSONL files on disk. The real deliverable is dashboards.
Unified eval directory structure
All eval code lives under apps/evals/ — one home for all eval types, one CLI, one leaderboard:
apps/evals/
├── __main__.py # unified CLI: python -m apps.evals {sql,catalog,agent}
├── dataface.yml # DuckDB source over output/ JSONL
├── faces/ # leaderboard dashboards (all eval types)
│ ├── overview.yml # cross-type summary (pass rates per eval type)
│ ├── sql-leaderboard.yml
│ ├── catalog-leaderboard.yml
│ ├── agent-leaderboard.yml
│ └── failure-analysis.yml
├── data/ # benchmark artifacts (checked in)
│ ├── benchmark.jsonl # cleaned dbt SQL benchmark
│ └── canary.jsonl # stratified subset for fast runs
├── output/ # eval results (gitignored)
│ ├── sql/ # text-to-SQL results
│ ├── catalog/ # catalog discovery results
│ └── agent/ # agent eval results (screenshots, scores)
├── sql/ # text-to-SQL eval code
│ ├── runner.py
│ ├── scorer.py
│ ├── backends.py # factory functions (make_raw_llm_backend, etc.)
│ └── types.py # BenchmarkCase, GenerationResult, GenerateFn
├── catalog/ # catalog discovery eval code
│ ├── runner.py
│ ├── scorer.py # IR metrics (recall@k, MRR)
│ └── prep.py # extract expected tables from gold SQL
├── agent/ # agent/dashboard eval code (migrated from apps/a_lie/)
│ ├── runner.py # generate dashboard from prompt
│ ├── screenshotter.py # capture rendered output
│ ├── reviewer.py # vision LLM scoring
│ ├── rubric.md
│ └── prompts/ # curated eval prompts
└── shared/ # shared across eval types
├── prep.py # benchmark cleaning script
└── reporting.py # breakdown/aggregation helpers
The agent eval code (runner.py, screenshotter.py, reviewer.py, rubric.md) migrates from apps/a_lie/ — the A lIe app stays where it is as the demo app, but the eval infrastructure moves to its proper home in apps/evals/agent/.
Data flow
- Eval runners write results to
apps/evals/output/{sql,catalog,agent}/(gitignored). apps/evals/dataface.ymldeclares a DuckDB source. Dashboard queries useread_json_auto()to read JSONL from the output subdirs. This should work the same way file-backed CSV workflows already do. If a small adapter/helper gap shows up around JSONL, close it inside this task rather than treating it as a separate blocker.- Dashboard faces in
apps/evals/faces/query across all eval types. dft servefromapps/evals/renders the unified leaderboard.
If eval results need to be shared beyond the local machine, that's M2 scope.
Unified CLI
python -m apps.evals sql --backend raw_llm --model gpt-4o ...
python -m apps.evals catalog --limit 100 ...
python -m apps.evals agent --prompts apps/evals/agent/prompts/ ...
python -m apps.evals agent --prompt "show me revenue by region" ...
Reporting dimensions
SQL evals: backend, backend_metadata (model, provider, context level), schema, complexity, category.
Catalog evals: retrieval method, k value, schema.
Agent evals: model, prompt, overall/narrative/yaml/visual scores.
Dashboard ideas
- Overview — cross-type summary: SQL pass rate, catalog recall@10, agent average score. One page to see overall AI quality.
- SQL leaderboard — pass rate by backend/model, sortable by dimension.
- SQL failure analysis — table of failed cases with question, gold SQL, generated SQL, failure reason.
- Catalog leaderboard — recall@k and MRR by retrieval method.
- Agent leaderboard — overall score by model, with per-prompt drill-down.
- Context impact — lift per context level across SQL and agent evals.
- Schema difficulty — which schemas are hardest? Pass rate by schema, sorted.
Dependencies
- Depends on task 3 (SQL eval runner) for the SQL output schema.
- Does not need to wait on the analytics repo bootstrap flow.
apps/evals/can be created directly in this repo; the analytics repo remains the canonicaldft initproving ground. - DuckDB/file-backed querying already exists in Dataface. If JSONL needs one small missing piece, add it here.
- Agent eval dashboards depend on the A lIe eval migration (M2 agent eval task).
Possible Solutions
- Recommended - Put a small DuckDB project config in
apps/evals/dataface.yml, keep the SQL as rawread_json_auto()queries overapps/evals/output/sql/**, and factor shared leaderboard queries into an underscore-prefixed query file imported by the visible faces. - Pros: no adapter work, one unified eval project, easy to add new faces, and the SQL stays close to the output schema. - Cons: face validation has to tolerate a query-only helper file and the dashboards only cover SQL until the catalog/agent outputs exist. - Flatten everything into standalone face files with duplicated SQL. - Pros: simplest validation story. - Cons: repeated query logic and harder to keep leaderboard dimensions consistent.
- Build a new helper layer under
apps/evals/that materializes DuckDB views from JSONL before Dataface renders. - Pros: stronger separation between raw JSONL and dashboards. - Cons: unnecessary extra code for this task and violates the "minimal support changes" goal.
Plan
- Create
apps/evals/dataface.ymlwith an in-memory DuckDB source as the project default. - Add a shared underscore-prefixed SQL query file under
apps/evals/faces/that reads the SQL run summaries and results JSONL viaread_json_auto(). - Add the visible leaderboard faces for overview, SQL leaderboard, SQL failure analysis, context impact, and schema difficulty.
- Add focused tests under
tests/evals/that validate the faces and render them against sample JSONL output in a temp project root. - Run
just task validate, the focused eval tests, andjust ci.
Implementation Progress
- Started from a clean branch on
codex/set-up-eval-leaderboard-dft-project-and-dashboards. - Added
apps/evals/dataface.ymlwith a DuckDB:memory:source and board defaults. - Added shared SQL run queries in
apps/evals/faces/_sql_eval_queries.ymlusingread_json_auto('output/sql/*/summary.json')andread_json_auto('output/sql/*/results.jsonl'). - Added faces for overview, SQL leaderboard, SQL failure analysis, context impact, and schema difficulty.
- Added
tests/evals/test_leaderboard_dft.pyto validate the face directory and smoke-render the dashboards against sample JSONL output in a temp project root. - Scoped DuckDB external access through an explicit allowlist (
enable_external_access) so eval dashboards can read local JSONL without reopening arbitrary config knobs. - Loaded project sources from
dataface.ymlso the eval project source definition is visible to both compile and execute paths. - Restored read-only sandbox tests alongside the eval opt-in path to preserve the default security model.
- Verified the dashboard project with a focused eval render test and the full repo
just cigate.
QA Exploration
- [x] QA exploration completed — verify dashboards render via
dft servewith sample eval output
Review Feedback
just reviewpassed after the DuckDB config allowlist, project-source loading, and sandbox test adjustments.-
Review sign-off: approved.
-
[x] Review cleared