mcp analyst agent
Purpose
Agent tools and workflows for AI-assisted analysis using Dataface context. This workstream builds the MCP server, tool definitions, and prompt workflows that let AI agents (in Cursor, Claude, etc.) interact with Dataface — inspecting schemas, generating dashboards, running queries, and iterating on analysis. The goal is that an analyst can describe what they want in natural language and an AI agent produces a working dashboard or analysis. Adjacent to inspect-profiler (which provides the data context the agent uses) and context-catalog-nimble (which defines how context is structured and surfaced).
Owner
- Data AI Engineer Architect
Initiatives
- AI Agent Surfaces — Completed, M1 — 5T Internal Pilot Ready, 8 / 8 tasks complete (100%)
- AI Quality Experimentation and Context Optimization — Planned, M2 — Internal Adoption + Design Partners, 4 / 10 tasks complete (40%)
- Benchmark-Driven Text-to-SQL and Discovery Evals — Planned, M2 — Internal Adoption + Design Partners, 3 / 6 tasks complete (50%)
- Dashboard linking — Ready, M2 — Internal Adoption + Design Partners, 1 / 2 tasks complete (50%)
- External Text-to-SQL Benchmarks and SOTA Calibration — Planned, M2 — Internal Adoption + Design Partners, 0 / 6 tasks complete (0%)
- Quickstart dashboards — In Progress, M2 — Internal Adoption + Design Partners, 3 / 3 tasks complete (100%)
Tasks by Milestone
A runnable prototype path exists for AI agent tool interfaces, execution workflows, and eval-driven behavior tuning, with concrete artifacts that prove the flow works end-to-end in the current codebase. Core assumptions are documented, known constraints are explicit, and the team can explain what is real versus mocked without ambiguity.
- Prototype gaps and follow-on capture Completed — Document top gaps and risks in eval and guardrail framework that must be addressed next.
- Prototype implementation path Completed — Implement a runnable end-to-end prototype path for MCP tool execution model.
- Prototype validation and proof Completed — Validate agent prompt/workflow behavior with concrete proof artifacts and repeatable steps.
Internal analysts can execute at least one weekly real workflow that depends on AI agent tool interfaces, execution workflows, and eval-driven behavior tuning in the 5T Analytics environment, without bespoke engineering intervention for every run. Instrumentation and feedback capture are in place so failures, friction points, and adoption gaps are visible and triaged with owners.
- Extract shared chat.js and chat_stream SSE endpoint Completed — Extract the shared chat component chat.js and chat_stream SSE endpoint as a standalone M1 task. This resolves the depen…
- MCP tooling contract for extension + Copilot dashboard/query generation Completed — Define and harden MCP tool inputs/outputs so extension and Copilot can reliably generate dashboards and queries in pilo…
- Unify Cloud AI Tool Dispatch to Use Canonical MCP Tools AI Agent Surfaces Completed — Replace the bespoke _execute_tool_sync() in apps/cloud/apps/ai/views.py (which only supports 4 tools: validate_yaml, te…
- Wire Playground AI to use MCP tools instead of bespoke tool set Completed — The Playground app currently maintains its own bespoke AI tools - validate_yaml, test_yaml_execution, execute_query_res…
- Add JSON render output format Completed — Add format=json to the render pipeline that walks the layout tree, executes queries, resolves charts, and returns the r…
- Refactor Cloud AI chat stream into scoped execution services Completed — Refactor apps/cloud/apps/ai/views.py chat_stream into smaller scope-resolution, tool-execution, and SSE-streaming units…
- Replace AI tool dispatch switch with registry-backed handlers Completed — Refactor dataface/ai/tools.py so canonical tool schemas and handlers are registered in one place instead of maintained…
- Save dashboard MCP tool - persist agent work to project CompletedPR #806PR at 2026-03-25T13:03:44-07:00 — Add a save_dashboard MCP tool that writes agent-generated YAML to the project file system. Currently all tools are stat…
- Scope playground MCP surface to playground sources Completed — Refactor the shared AI/MCP surface to accept an injected context for adapter registry, dashboard directory, base dir, a…
- Wire Dataface to internal analytics repo and BigQuery source Cancelled — Set up the Dataface-side access path to the internal analytics warehouse and sibling analytics dbt repo. Use /Users/dav…
- Add resolved YAML render output format Completed — Add a format=yaml output that produces a resolved dataface YAML -- auto chart types filled in, auto-detected fields exp…
- Type terminal agent event protocol and provider stream adapters Completed — Refactor the terminal agent loop introduced in dataface/ai/agent.py and dataface/ai/llm.py to use explicit typed event…
AI agent tool interfaces, execution workflows, and eval-driven behavior tuning is hardened enough for regular use by multiple internal teams and initial design partners, with a predictable response loop for issues and requests. Quality expectations are documented, and prioritized improvements from real usage are actively incorporated into delivery.
- Add 'describe' or 'text' render output format for AI agents Completed — Add a `describe`/`text` render mode so AI agents can request compact textual dashboard outputs instead of visual payloa…
- Add BIRD mini-dev adapter and runner support External Text-to-SQL Benchmarks and SOTA Calibration Waiting on plan-external-text-to-sql-benchmark-adoption-order-and-constraints, define-normalized-external-benchmark-case-and-result-contracts-for-apps-evals — Integrate BIRD mini-dev into apps/evals with a loader, benchmark-specific runner settings, and reproducible local basel…
- Add external benchmark provenance dashboards and cross-benchmark slices External Text-to-SQL Benchmarks and SOTA Calibration Waiting on define-normalized-external-benchmark-case-and-result-contracts-for-apps-evals, add-bird-mini-dev-adapter-and-runner-support, add-spider-2-lite-adapter-and-runner-support — Extend the eval leaderboard so external benchmark runs are comparable by benchmark, split, dialect, scorer, and environ…
- Add Spider 2.0 Lite adapter and runner support External Text-to-SQL Benchmarks and SOTA Calibration Waiting on plan-external-text-to-sql-benchmark-adoption-order-and-constraints, define-normalized-external-benchmark-case-and-result-contracts-for-apps-evals — Integrate Spider 2.0-Lite into apps/evals with benchmark-aware loading, environment assumptions, and reproducible basel…
- Adoption hardening for internal teams — Harden MCP tool execution model for repeated use across multiple internal teams and first design partners.
- Build bounded non-one-shot text-to-SQL stack for local evals Benchmark-Driven Text-to-SQL and Discovery Evals — Build an experimental local-only text-to-SQL backend that wraps the existing shared generator in a bounded plan - gener…
- Build text-to-SQL eval runner and deterministic scorer Completed — Build a Dataface text-to-SQL eval harness that runs agent/model prompts against the cleaned benchmark and scores output…
- Chat-First Home Page - Conversational AI Interface for Dataface Cloud AI Agent Surfaces Completed — Replace the current org home page (dashboard grid) with a chat-first interface. The home screen shows existing dashboar…
- Create cleaned dbt SQL benchmark artifact Completed — Create a reproducible benchmark-prep step that imports the raw dbt dataset from cto-research, filters out AISQL rows, r…
- Define normalized external benchmark case and result contracts for apps/evals External Text-to-SQL Benchmarks and SOTA Calibration Waiting on plan-external-text-to-sql-benchmark-adoption-order-and-constraints — Define the internal contracts that map external benchmark tasks, metadata, and outputs into Dataface eval types without…
- Design-partner feedback loop operations — Operationalize rapid feedback-to-fix loop for agent prompt/workflow behavior with explicit decision logs.
- Embeddable Dashboards in Chat - Inline Preview, Modal Expand, and Save to Repo AI Agent Surfaces Completed — Dashboards generated during chat conversations can be embedded inline as interactive previews. Users click to expand in…
- Extract shared text-to-SQL generation function Benchmark-Driven Text-to-SQL and Discovery Evals Completed — Extract a shared generate_sql(question, context_provider, model) function, wire render_dashboard and cloud AIService to…
- MCP and skills auto-install across all AI clients Completed — Expand dft mcp init to cover VS Code, Claude Code, and GitHub Copilot Coding Agent. Register MCP server programmaticall…
- Plan external text-to-SQL benchmark adoption order and constraints External Text-to-SQL Benchmarks and SOTA Calibration — Decide which public benchmarks to adopt first, what environments and licenses they require, and what the phase-1 integr…
- Quality standards and guardrails — Define and enforce quality standards for eval and guardrail framework to keep output consistent as contributors expand.
- Run agent eval loop with internal analysts — Establish repeatable agent-level eval workflow that tests the full loop (prompt → tool use → SQL generation → dashboard…
- Set up eval leaderboard dft project and dashboards Benchmark-Driven Text-to-SQL and Discovery Evals Completed — Create a dft project inside the eval output directory with dashboard faces that visualize eval results as a leaderboard…
- Task M2 schema-aware query planning for cloud chat questions AI Agent Surfaces — The home-page/org chat can answer some data questions, but it sometimes generates SQL against columns that do not exist…
- Terminal Agent TUI - dft agent Completed — Build a Claude Code-like terminal AI agent as a dft subcommand. The agent comes pre-loaded with Dataface MCP tools and…
- Add catalog discovery evals derived from SQL benchmark Completed — Adapt the dbt SQL benchmark into search/catalog discovery eval cases by extracting expected tables from gold SQL and ge…
- Add LiveSQLBench adapter and release-tracking workflow External Text-to-SQL Benchmarks and SOTA Calibration Waiting on plan-external-text-to-sql-benchmark-adoption-order-and-constraints, define-normalized-external-benchmark-case-and-result-contracts-for-apps-evals — Add a second-wave integration for LiveSQLBench with explicit handling for release versions, hidden-vs-open splits, and…
- Add persistent analyst memories and learned context AI Quality Experimentation and Context Optimization Completed — Design and implement a memories file that accumulates knowledge from analyst queries — table quirks, column semantics,…
- Chat Conversation Persistence and History AI Agent Surfaces Completed — Add ChatSession and ChatMessage Django models so chat conversations survive page refreshes. Show recent conversations i…
- Curate schema and table scope for eval benchmark AI Quality Experimentation and Context Optimization Completed — Decide which schemas, tables, and data layers (raw, silver/staging, gold/marts) to include in the eval scope and catalo…
- Dashboard linking v1 (dashboard-root paths, render-time rewrite, docs) Dashboard linking Completed — Implement cross-board linking for YAML dashboard markdown (dashboard-root paths, ../ relative, suffix strip, render-tim…
- Experiment: Catalog tool access with vs without tool AI Quality Experimentation and Context Optimization — Compare runs with and without catalog/schema tool access to measure whether the tool itself materially improves SQL gen…
- Experiment: Context ablation L0 vs L1 vs L3 vs L5 AI Quality Experimentation and Context Optimization CompletedPR #801PR at 2026-03-25T09:17:39-07:00 — Measure how schema context layers affect SQL quality on the canary set, starting with names-only versus richer descript…
- Experiment: Layer scope all vs gold-only vs gold+silver AI Quality Experimentation and Context Optimization — Test whether exposing all tables, gold-only tables, or gold-plus-silver scope produces better SQL quality and lower noi…
- Experiment: Model comparison GPT-4o vs Claude Sonnet AI Quality Experimentation and Context Optimization — Compare GPT-4o and Claude Sonnet on the same canary set, prompt, and context configuration to isolate model effects on…
- Experiment: Model comparison GPT-4o vs GPT-5 AI Quality Experimentation and Context Optimization — Compare GPT-4o and GPT-5 on the same canary set, prompt, and context configuration to quantify quality and cost differe…
- Experiment: Model comparison GPT-5 vs Claude Sonnet AI Quality Experimentation and Context Optimization — Compare GPT-5 and Claude Sonnet on the same canary set, prompt, and context configuration to isolate model effects on S…
- Experiment: Schema tool strategy profiled vs filtered vs INFORMATION_SCHEMA vs none AI Quality Experimentation and Context Optimization — Compare schema acquisition strategies to determine whether profiled catalog fields, filtered fields, live INFORMATION_S…
- Persist eval outputs for Dataface analysis and boards Cancelled — Define the canonical eval artifact schema for run metadata, per-case results, retrieval results, and summaries. Add loa…
- Persist eval runs by default and add a quick eval dashboard serve command Benchmark-Driven Text-to-SQL and Discovery Evals Completed — Make eval runs durable by default instead of transient output, and add a one-command local entrypoint for browsing the…
- Polish debug tool activity styling in cloud chat AI Agent Surfaces — Keep tool activity visible in debug mode, but replace the current raw/emojified presentation with a cleaner system stat…
- Quickstart dashboard pack — Salesforce dbt project pilot Quickstart dashboards CompletedPR #718PR at 2026-03-22T21:55:00-07:00 — Pilot the quickstart dashboard process on the Salesforce quickstart dbt repo: checkout, dft init, run the product-resea…
- Quickstart dashboard pack — Zendesk dbt project pilot Quickstart dashboards CompletedPR #717PR at 2026-03-22T21:52:54-07:00 — Second pilot for the quickstart dashboard process on the Zendesk quickstart dbt repo: same workflow as Salesforce pilot…
- Quickstart dashboards — program setup (workspace, skill, pilot process) Quickstart dashboards Completed — Establish a repeatable program for Dataface dashboard packs on Fivetran quickstart open-source dbt projects: workspace…
- Run context and model ablation experiments AI Quality Experimentation and Context Optimization Completed — Define and execute the initial experiment matrix using the eval system. Compare models (GPT-4o, GPT-5, Claude Sonnet, e…
Launch scope for AI agent tool interfaces, execution workflows, and eval-driven behavior tuning is complete, externally explainable, and supportable: user-facing behavior is stable, documentation is publishable, and operational ownership is explicit. Remaining gaps are non-blocking, risk-assessed, and tracked as post-launch follow-up rather than unresolved launch debt.
- Launch docs and external readiness — Publish external-facing documentation and examples for agent prompt/workflow behavior that are executable by new users.
- Launch operations and reliability readiness — Finalize operational readiness for eval and guardrail framework: telemetry, alerting, support ownership, and incident p…
- Public launch scope completion — Complete launch-critical scope for MCP tool execution model with production-safe behavior and rollback clarity.
- Add deterministic candidate selection baselines for bounded text-to-SQL Waiting on build-bounded-non-one-shot-text-to-sql-stack-for-local-evals — Compare simple winner-selection heuristics such as first valid parse fewest grounding errors and best plan overlap befo…
- Add latest regression delta dashboards for text-to-SQL evals Waiting on build-text-to-sql-failure-taxonomy-and-slice-dashboard — Show newly fixed newly broken and severity-shifted benchmark cases between runs so eval iteration can focus on concrete…
- Add retrieval-side gold labels to text-to-SQL benchmark — Augment benchmark cases with expected tables columns and semantic targets so retrieval quality can be measured directly…
- Add semantic SQL diff and partial-credit scoring — Score tables filters grouping aggregation ordering and other semantic components separately so near misses are measurab…
- Add structured planning evals separate from final SQL quality Waiting on build-bounded-non-one-shot-text-to-sql-stack-for-local-evals — Evaluate planner outputs for table filter grain and metric selection independently from final SQL so planning quality c…
- Build text-to-SQL failure taxonomy and slice dashboard — Classify benchmark failures by table column join filter aggregation grain ordering and syntax mistakes and surface thos…
- Capture candidate plan and repair telemetry in eval artifacts Waiting on build-bounded-non-one-shot-text-to-sql-stack-for-local-evals — Persist planner outputs candidate rejection reasons repair attempts and winner selection metadata so bounded non-one-sh…
- Compare question bundle variants for text-to-SQL evals Waiting on compare-text-to-sql-evals-with-question-aware-retrieval-vs-full-context-prompting — Run structured evals comparing full context against multiple bundle variants such as relationships plan hints and value…
- Desktop app - lightweight wrapper around Dataface Cloud web UI — Build a desktop application that wraps the Dataface Cloud web interface. Provides native OS integration like menu bar,…
- Patch-based AI edits for dashboard YAML — Instead of AI regenerating entire YAML files when refining dashboards, support targeted YAML patches inspired by json-r…
- Schema-derived AI prompts from compiled types — Auto-generate LLM system prompts from the Dataface schema definition rather than hand-maintaining prompt templates. Ins…
- Skill and tool quality evaluation framework — Build a framework to A/B test whether individual MCP skills improve agent output quality vs raw tool access. Measure sk…
Post-launch stabilization is complete for AI agent tool interfaces, execution workflows, and eval-driven behavior tuning: recurring incidents are reduced, support burden is lower, and quality gates are enforced consistently before release. The team has a repeatable operating model for maintenance, regression prevention, and measured reliability improvements.
- Regression prevention and quality gates — Add or enforce regression gates around agent prompt/workflow behavior so release quality is sustained automatically.
- Sustainable operating model — Document and adopt sustainable operating model for eval and guardrail framework across support, triage, and release cad…
- v1.0 stability and defect burn-down — Run stability program for MCP tool execution model with recurring defect burn-down and reliability trend tracking.
- Add bounded execution-guided smoke repair mode for text-to-SQL evals Waiting on build-bounded-non-one-shot-text-to-sql-stack-for-local-evals — Add an optional local-only execution-guided repair pass for already plausible candidates so cheap execution evidence ca…
- Add contamination-resistant benchmark splits and holdout guardrails Benchmark-Driven Text-to-SQL and Discovery Evals — Separate iteration canaries from held-out benchmark slices and add process guardrails so prompt and retrieval tuning do…
- Add judge calibration and disagreement audits for text-to-SQL evals — Audit where deterministic scoring and LLM judging disagree so leaderboard movements can distinguish real generation imp…
- Expand text-to-SQL benchmark with paraphrase clusters Benchmark-Driven Text-to-SQL and Discovery Evals — Add multiple phrasings for the same logical query intent so robustness can be measured separately from fitting one benc…
v1.2 delivers meaningful depth improvements in AI agent tool interfaces, execution workflows, and eval-driven behavior tuning based on observed usage and retention signals, not just roadmap intent. Enhancements improve real customer outcomes, and release readiness is demonstrated through metrics, regression coverage, and clear migration guidance where relevant.
- Add eval loop for dashboard search and variable-scoped navigation — Build a repeatable eval loop for dashboard search that measures not only whether the right dashboard is retrieved, but…
- Expand dashboard search to return variable-scoped deep links — Extend dashboard search and handoff flows so agents and users can navigate to an existing dashboard with explicit varia…
- Quality and performance improvements — Ship measurable quality/performance improvements in agent prompt/workflow behavior tied to user-facing outcomes.
- v1.2 depth expansion — Deliver depth expansion in MCP tool execution model prioritized by observed usage and retention outcomes.
- v1.2 release and migration readiness — Prepare v1.2 release/migration readiness for eval and guardrail framework, including communication and upgrade guidance.
- Add business-semantic ambiguity slices to text-to-SQL benchmark Waiting on add-retrieval-side-gold-labels-to-text-to-sql-benchmark — Add benchmark cases where business meaning is ambiguous despite simple schema names such as gross versus net revenue or…
- Add schema-linking supervision to text-to-SQL benchmark cases Waiting on add-retrieval-side-gold-labels-to-text-to-sql-benchmark — Annotate question spans to expected tables columns metrics and filters so retrieval and planning quality can be evaluat…
- Dashboard linking — canonical entity metadata (M5, only if needed) Dashboard linking — Revisit optional YAML metadata (canonical_for / entity+role) for default drill targets and agents. V1–M2 explicitly def…
Long-horizon opportunities for AI agent tool interfaces, execution workflows, and eval-driven behavior tuning are captured as concrete hypotheses with user impact, prerequisites, and evaluation criteria. Ideas are ranked by strategic value and feasibility so future investment decisions can be made quickly with less rediscovery.
- Experiment design for future bets — Design validation experiments for eval and guardrail framework so future bets can be tested before major investment.
- Future opportunity research — Capture long-horizon opportunities for MCP tool execution model with user impact and strategic fit.
- Prerequisite and dependency mapping — Map enabling prerequisites and dependencies for agent prompt/workflow behavior to reduce future startup cost.
- Streaming YAML generation with early query execution — When AI generates a dashboard, stream the YAML and begin executing queries as they arrive rather than waiting for the f…