Run agent eval loop with internal analysts
Problem
There is no systematic way to measure whether MCP agent interactions are improving or regressing over time. Prompt changes, tool schema updates, and new rendering features are shipped without evaluating their impact on agent task success rates. Without a repeatable eval loop — a curated set of analyst prompts, expected outcomes, and automated scoring — improvements are anecdotal and regressions go undetected until users report them. Internal analyst sessions generate valuable signal about what works and what fails, but that signal is lost because there is no structured capture-and-evaluate pipeline.
Context
What's already built (M1 eval system)
The M1 benchmark initiative builds eval infrastructure for two narrow layers:
-
Text-to-SQL eval — given a question + schema context, does the model produce correct SQL? Scored with deterministic structural checks + LLM-as-judge semantic equivalence. Fast, cheap, runs offline against gold SQL. See
build-text-to-sql-eval-runner-and-deterministic-scorer.md. -
Catalog discovery eval — given a question, can the search/catalog tools surface the right tables? IR metrics (recall@k, MRR). See
add-catalog-discovery-evals-derived-from-sql-benchmark.md.
These test SQL generation and retrieval in isolation. They do NOT test:
- Agent tool use — does the agent call the right tools in the right order? (catalog → ask_sql → render_dashboard?)
- Dashboard generation — does the full YAML output compile, render, and look good?
- Visual quality — correct chart types, labels, layout, design choices?
- End-to-end — "show me revenue by region" → working, correct, well-designed dashboard?
What's already built (A lIe eval skill)
The alie-eval skill (.codex/skills/alie-eval/SKILL.md) already runs end-to-end agent evals: generates dashboards from natural language prompts, screenshots the rendered output, and scores with a vision-capable LLM against a rubric. It scores on:
- Narrative quality
- YAML correctness
- Dashboard visual quality (from screenshots)
- Overall score
This is the right foundation for agent-level eval. It tests levels 3-6. But it currently: - Uses a fixed set of 10 prompts (synthetic, not from real analyst sessions) - Runs against mock/example data, not the analytics warehouse - Has no connection to the M1 benchmark or structured scoring model - Doesn't capture tool-use traces or SQL quality metrics
Migration: A lIe eval code → apps/evals/agent/
The eval infrastructure currently in apps/a_lie/ (run_evals.py, review_evals.py, eval_rubric.md) moves to apps/evals/agent/ as part of the unified eval directory. The A lIe app itself stays in apps/a_lie/ — it's the demo app, not the eval system. The eval code moves to its proper home alongside SQL and catalog evals.
After migration:
- apps/evals/agent/runner.py — generate dashboard from prompt
- apps/evals/agent/screenshotter.py — capture rendered output
- apps/evals/agent/reviewer.py — vision LLM scoring
- apps/evals/agent/rubric.md — scoring rubric
- apps/evals/agent/prompts/ — curated eval prompts (starting with the existing 10, growing with analyst sessions)
- apps/evals/output/agent/ — results (screenshots, scores, JSONL)
The unified CLI runs agent evals via python -m apps.evals agent ... and the leaderboard dashboards show agent scores alongside SQL and catalog metrics.
This task: bridge the gap
This task connects the M1 eval system (SQL-level precision) with the A lIe eval skill (end-to-end visual scoring) and adds the analyst feedback loop:
| Layer | What's tested | Eval system | Speed |
|---|---|---|---|
| SQL generation | Question → SQL | M1 eval runner | Fast (100s/min) |
| Catalog retrieval | Question → tables | M1 catalog eval | Fast |
| Agent tool use | Prompt → tool call sequence | This task | Medium |
| Dashboard quality | Prompt → rendered dashboard | A lIe eval (existing) | Slow |
| End-to-end | Prompt → correct dashboard on real data | This task | Slow |
Analyst session integration
The unique value of M2 is real analyst prompts against the real warehouse: - Analysts try to build dashboards → capture their prompts and success/failure - Failed sessions become eval cases (regression tests for agent behavior) - Successful sessions validate the agent works for real use cases - Weekly: run the growing prompt set, compare scores across code changes
Key files
apps/evals/agent/— agent eval code (migrated fromapps/a_lie/)apps/evals/agent/rubric.md— scoring rubric for visual reviewapps/evals/agent/prompts/— curated eval promptsapps/evals/sql/— M1 SQL eval infrastructureapps/evals/faces/— unified leaderboard dashboards.codex/skills/alie-eval/SKILL.md— skill wrapper (update to point at new paths)dataface/ai/agent.py— agent loop implementationdataface/ai/tool_schemas.py— tool definitions (includingask_sql)
Dependencies
- M1 eval runner operational (for SQL-level metrics within agent evals)
- A lIe eval skill functional (for dashboard visual scoring)
- BQ connection wired (
wire-dataface-to-internal-analytics-repo-and-bigquery-source.md) — agent evals against real warehouse data need this ask_sqlMCP tool landed (extract-shared-text-to-sql-generation-function.md)
Possible Solutions
- A - Keep evaluating changes by manually trying a few prompts: quick for local work, but too inconsistent for trend tracking.
- B - Recommended: define a bounded v1 agent eval loop: choose representative tasks, reproducible runs, scoring/review artifacts, and a cadence that lets the team compare changes over time.
- C - Block agent changes until a full benchmark platform exists: safest in theory, but far too heavy for the current stage.
Plan
- Choose the first set of representative analyst tasks and the output artifacts needed to compare runs meaningfully.
- Define the v1 execution and review loop, including scoring rules, human-review hooks, and where results are stored.
- Wire the eval loop into at least one practical development workflow so it is used regularly rather than remaining a side artifact.
- Use the first evaluation cycles to refine case selection, scoring thresholds, and the follow-up process for detected regressions.
Implementation Progress
QA Exploration
- [ ] QA exploration completed
Review Feedback
- [ ] Review cleared