Evals
The eval system measures how well Dataface's AI agents perform on real tasks. Each eval family targets a specific agent capability — text-to-SQL generation, catalog discovery, dashboard authoring — and scores agent outputs against a curated benchmark of expected results.
A run is a single execution of an eval family with a specific backend configuration (model, provider, context level). Runs produce scored results that feed the dashboards below, letting you compare models, context strategies, and code changes over time.
Dashboards & Artifacts
just looker-migrate serve)Latest runs
No eval run artifacts found under apps/evals/runs/.
Placement
Put generated eval run outputs under apps/evals/runs/<family>/<run_id>/.
The tasks server picks them up automatically for the landing page and exposes them under
/evals/artifacts/. Dataface eval faces are served under /evals/faces/.
Current assumptions: eval dashboards are the existing Dataface project in apps/evals/,
and most raw artifacts are JSON/JSONL files plus any generated static files inside each run directory.