Run agent eval loop with internal analysts

ID	M2-MCP-001
Status	not_started
Priority	p1
Milestone	m2-internal-adoption-design-partners
Owner	data-ai-engineer-architect

Problem

There is no systematic way to measure whether MCP agent interactions are improving or regressing over time. Prompt changes, tool schema updates, and new rendering features are shipped without evaluating their impact on agent task success rates. Without a repeatable eval loop — a curated set of analyst prompts, expected outcomes, and automated scoring — improvements are anecdotal and regressions go undetected until users report them. Internal analyst sessions generate valuable signal about what works and what fails, but that signal is lost because there is no structured capture-and-evaluate pipeline.

Context

What's already built (M1 eval system)

The M1 benchmark initiative builds eval infrastructure for two narrow layers:

Text-to-SQL eval — given a question + schema context, does the model produce correct SQL? Scored with deterministic structural checks + LLM-as-judge semantic equivalence. Fast, cheap, runs offline against gold SQL. See build-text-to-sql-eval-runner-and-deterministic-scorer.md.
Catalog discovery eval — given a question, can the search/catalog tools surface the right tables? IR metrics (recall@k, MRR). See add-catalog-discovery-evals-derived-from-sql-benchmark.md.

These test SQL generation and retrieval in isolation. They do NOT test:

Agent tool use — does the agent call the right tools in the right order? (catalog → ask_sql → render_dashboard?)
Dashboard generation — does the full YAML output compile, render, and look good?
Visual quality — correct chart types, labels, layout, design choices?
End-to-end — "show me revenue by region" → working, correct, well-designed dashboard?

What's already built (A lIe eval skill)

The alie-eval skill (.codex/skills/alie-eval/SKILL.md) already runs end-to-end agent evals: generates dashboards from natural language prompts, screenshots the rendered output, and scores with a vision-capable LLM against a rubric. It scores on: - Narrative quality - YAML correctness - Dashboard visual quality (from screenshots) - Overall score

This is the right foundation for agent-level eval. It tests levels 3-6. But it currently: - Uses a fixed set of 10 prompts (synthetic, not from real analyst sessions) - Runs against mock/example data, not the analytics warehouse - Has no connection to the M1 benchmark or structured scoring model - Doesn't capture tool-use traces or SQL quality metrics

Migration: A lIe eval code → apps/evals/agent/

The eval infrastructure currently in apps/a_lie/ (run_evals.py, review_evals.py, eval_rubric.md) moves to apps/evals/agent/ as part of the unified eval directory. The A lIe app itself stays in apps/a_lie/ — it's the demo app, not the eval system. The eval code moves to its proper home alongside SQL and catalog evals.

After migration: - apps/evals/agent/runner.py — generate dashboard from prompt - apps/evals/agent/screenshotter.py — capture rendered output - apps/evals/agent/reviewer.py — vision LLM scoring - apps/evals/agent/rubric.md — scoring rubric - apps/evals/agent/prompts/ — curated eval prompts (starting with the existing 10, growing with analyst sessions) - apps/evals/output/agent/ — results (screenshots, scores, JSONL)

The unified CLI runs agent evals via python -m apps.evals agent ... and the leaderboard dashboards show agent scores alongside SQL and catalog metrics.

This task: bridge the gap

This task connects the M1 eval system (SQL-level precision) with the A lIe eval skill (end-to-end visual scoring) and adds the analyst feedback loop:

Layer	What's tested	Eval system	Speed
SQL generation	Question → SQL	M1 eval runner	Fast (100s/min)
Catalog retrieval	Question → tables	M1 catalog eval	Fast
Agent tool use	Prompt → tool call sequence	This task	Medium
Dashboard quality	Prompt → rendered dashboard	A lIe eval (existing)	Slow
End-to-end	Prompt → correct dashboard on real data	This task	Slow

Analyst session integration

The unique value of M2 is real analyst prompts against the real warehouse: - Analysts try to build dashboards → capture their prompts and success/failure - Failed sessions become eval cases (regression tests for agent behavior) - Successful sessions validate the agent works for real use cases - Weekly: run the growing prompt set, compare scores across code changes

Key files

apps/evals/agent/ — agent eval code (migrated from apps/a_lie/)
apps/evals/agent/rubric.md — scoring rubric for visual review
apps/evals/agent/prompts/ — curated eval prompts
apps/evals/sql/ — M1 SQL eval infrastructure
apps/evals/faces/ — unified leaderboard dashboards
.codex/skills/alie-eval/SKILL.md — skill wrapper (update to point at new paths)
dataface/ai/agent.py — agent loop implementation
dataface/ai/tool_schemas.py — tool definitions (including ask_sql)

Dependencies

M1 eval runner operational (for SQL-level metrics within agent evals)
A lIe eval skill functional (for dashboard visual scoring)
BQ connection wired (wire-dataface-to-internal-analytics-repo-and-bigquery-source.md) — agent evals against real warehouse data need this
ask_sql MCP tool landed (extract-shared-text-to-sql-generation-function.md)

Possible Solutions

A - Keep evaluating changes by manually trying a few prompts: quick for local work, but too inconsistent for trend tracking.
B - Recommended: define a bounded v1 agent eval loop: choose representative tasks, reproducible runs, scoring/review artifacts, and a cadence that lets the team compare changes over time.
C - Block agent changes until a full benchmark platform exists: safest in theory, but far too heavy for the current stage.

Plan

Choose the first set of representative analyst tasks and the output artifacts needed to compare runs meaningfully.
Define the v1 execution and review loop, including scoring rules, human-review hooks, and where results are stored.
Wire the eval loop into at least one practical development workflow so it is used regularly rather than remaining a side artifact.
Use the first evaluation cycles to refine case selection, scoring thresholds, and the follow-up process for detected regressions.