Dataface Tasks

Set up eval leaderboard dft project and dashboards

IDMCP_ANALYST_AGENT-SET_UP_EVAL_LEADERBOARD_DFT_PROJECT_AND_DASHBOARDS
Statuscompleted
Priorityp1
Milestonem2-internal-adoption-design-partners
Ownerdata-ai-engineer-architect
Initiativebenchmark-driven-text-to-sql-and-discovery-evals
Completed bydave
Completed2026-03-18

Problem

Create a dft project inside the eval output directory with dashboard faces that visualize eval results as a leaderboard. Compare models, prompt versions, and context configurations side-by-side. Serve via dft serve so results are browsable locally after any eval run. This is the concrete deliverable for task 5 (persist eval outputs) — not a persistence layer, but actual Dataface dashboards over eval JSONL.

Context

What this replaces

Task 5 (persist eval outputs) was originally a separate task framed as a persistence layer. That task is now cancelled and merged here. The "persistence" is trivial — JSONL files on disk. The real deliverable is dashboards.

Unified eval directory structure

All eval code lives under apps/evals/ — one home for all eval types, one CLI, one leaderboard:

apps/evals/
├── __main__.py           # unified CLI: python -m apps.evals {sql,catalog,agent}
├── dataface.yml          # DuckDB source over output/ JSONL
├── faces/                # leaderboard dashboards (all eval types)
│   ├── overview.yml      # cross-type summary (pass rates per eval type)
│   ├── sql-leaderboard.yml
│   ├── catalog-leaderboard.yml
│   ├── agent-leaderboard.yml
│   └── failure-analysis.yml
├── data/                 # benchmark artifacts (checked in)
│   ├── benchmark.jsonl   # cleaned dbt SQL benchmark
│   └── canary.jsonl      # stratified subset for fast runs
├── output/               # eval results (gitignored)
│   ├── sql/              # text-to-SQL results
│   ├── catalog/          # catalog discovery results
│   └── agent/            # agent eval results (screenshots, scores)
├── sql/                  # text-to-SQL eval code
│   ├── runner.py
│   ├── scorer.py
│   ├── backends.py       # factory functions (make_raw_llm_backend, etc.)
│   └── types.py          # BenchmarkCase, GenerationResult, GenerateFn
├── catalog/              # catalog discovery eval code
│   ├── runner.py
│   ├── scorer.py         # IR metrics (recall@k, MRR)
│   └── prep.py           # extract expected tables from gold SQL
├── agent/                # agent/dashboard eval code (migrated from apps/a_lie/)
│   ├── runner.py         # generate dashboard from prompt
│   ├── screenshotter.py  # capture rendered output
│   ├── reviewer.py       # vision LLM scoring
│   ├── rubric.md
│   └── prompts/          # curated eval prompts
└── shared/               # shared across eval types
    ├── prep.py           # benchmark cleaning script
    └── reporting.py      # breakdown/aggregation helpers

The agent eval code (runner.py, screenshotter.py, reviewer.py, rubric.md) migrates from apps/a_lie/ — the A lIe app stays where it is as the demo app, but the eval infrastructure moves to its proper home in apps/evals/agent/.

Data flow

  1. Eval runners write results to apps/evals/output/{sql,catalog,agent}/ (gitignored).
  2. apps/evals/dataface.yml declares a DuckDB source. Dashboard queries use read_json_auto() to read JSONL from the output subdirs. This should work the same way file-backed CSV workflows already do. If a small adapter/helper gap shows up around JSONL, close it inside this task rather than treating it as a separate blocker.
  3. Dashboard faces in apps/evals/faces/ query across all eval types.
  4. dft serve from apps/evals/ renders the unified leaderboard.

If eval results need to be shared beyond the local machine, that's M2 scope.

Unified CLI

python -m apps.evals sql --backend raw_llm --model gpt-4o ...
python -m apps.evals catalog --limit 100 ...
python -m apps.evals agent --prompts apps/evals/agent/prompts/ ...
python -m apps.evals agent --prompt "show me revenue by region" ...

Reporting dimensions

SQL evals: backend, backend_metadata (model, provider, context level), schema, complexity, category. Catalog evals: retrieval method, k value, schema. Agent evals: model, prompt, overall/narrative/yaml/visual scores.

Dashboard ideas

  • Overview — cross-type summary: SQL pass rate, catalog recall@10, agent average score. One page to see overall AI quality.
  • SQL leaderboard — pass rate by backend/model, sortable by dimension.
  • SQL failure analysis — table of failed cases with question, gold SQL, generated SQL, failure reason.
  • Catalog leaderboard — recall@k and MRR by retrieval method.
  • Agent leaderboard — overall score by model, with per-prompt drill-down.
  • Context impact — lift per context level across SQL and agent evals.
  • Schema difficulty — which schemas are hardest? Pass rate by schema, sorted.

Dependencies

  • Depends on task 3 (SQL eval runner) for the SQL output schema.
  • Does not need to wait on the analytics repo bootstrap flow. apps/evals/ can be created directly in this repo; the analytics repo remains the canonical dft init proving ground.
  • DuckDB/file-backed querying already exists in Dataface. If JSONL needs one small missing piece, add it here.
  • Agent eval dashboards depend on the A lIe eval migration (M2 agent eval task).

Possible Solutions

  1. Recommended - Put a small DuckDB project config in apps/evals/dataface.yml, keep the SQL as raw read_json_auto() queries over apps/evals/output/sql/**, and factor shared leaderboard queries into an underscore-prefixed query file imported by the visible faces. - Pros: no adapter work, one unified eval project, easy to add new faces, and the SQL stays close to the output schema. - Cons: face validation has to tolerate a query-only helper file and the dashboards only cover SQL until the catalog/agent outputs exist.
  2. Flatten everything into standalone face files with duplicated SQL. - Pros: simplest validation story. - Cons: repeated query logic and harder to keep leaderboard dimensions consistent.
  3. Build a new helper layer under apps/evals/ that materializes DuckDB views from JSONL before Dataface renders. - Pros: stronger separation between raw JSONL and dashboards. - Cons: unnecessary extra code for this task and violates the "minimal support changes" goal.

Plan

  1. Create apps/evals/dataface.yml with an in-memory DuckDB source as the project default.
  2. Add a shared underscore-prefixed SQL query file under apps/evals/faces/ that reads the SQL run summaries and results JSONL via read_json_auto().
  3. Add the visible leaderboard faces for overview, SQL leaderboard, SQL failure analysis, context impact, and schema difficulty.
  4. Add focused tests under tests/evals/ that validate the faces and render them against sample JSONL output in a temp project root.
  5. Run just task validate, the focused eval tests, and just ci.

Implementation Progress

  • Started from a clean branch on codex/set-up-eval-leaderboard-dft-project-and-dashboards.
  • Added apps/evals/dataface.yml with a DuckDB :memory: source and board defaults.
  • Added shared SQL run queries in apps/evals/faces/_sql_eval_queries.yml using read_json_auto('output/sql/*/summary.json') and read_json_auto('output/sql/*/results.jsonl').
  • Added faces for overview, SQL leaderboard, SQL failure analysis, context impact, and schema difficulty.
  • Added tests/evals/test_leaderboard_dft.py to validate the face directory and smoke-render the dashboards against sample JSONL output in a temp project root.
  • Scoped DuckDB external access through an explicit allowlist (enable_external_access) so eval dashboards can read local JSONL without reopening arbitrary config knobs.
  • Loaded project sources from dataface.yml so the eval project source definition is visible to both compile and execute paths.
  • Restored read-only sandbox tests alongside the eval opt-in path to preserve the default security model.
  • Verified the dashboard project with a focused eval render test and the full repo just ci gate.

QA Exploration

  • [x] QA exploration completed — verify dashboards render via dft serve with sample eval output

Review Feedback

  • just review passed after the DuckDB config allowlist, project-source loading, and sandbox test adjustments.
  • Review sign-off: approved.

  • [x] Review cleared