Evals

The eval system measures how well Dataface's AI agents perform on real tasks. Each eval family targets a specific agent capability — text-to-SQL generation, catalog discovery, dashboard authoring — and scores agent outputs against a curated benchmark of expected results.

A run is a single execution of an eval family with a specific backend configuration (model, provider, context level). Runs produce scored results that feed the dashboards below, letting you compare models, context strategies, and code changes over time.

Dashboards & Artifacts

Overview

Top-line eval dashboard

SQL Leaderboard

Compare SQL runs

Artifacts

Raw run files and directories

Looker Migrate

Looker migration progress + compare (just looker-migrate serve)

Latest runs

No eval run artifacts found under apps/evals/runs/.

Placement

Put generated eval run outputs under apps/evals/runs/<family>/<run_id>/. The tasks server picks them up automatically for the landing page and exposes them under /evals/artifacts/. Dataface eval faces are served under /evals/faces/.

Current assumptions: eval dashboards are the existing Dataface project in apps/evals/, and most raw artifacts are JSON/JSONL files plus any generated static files inside each run directory.