Add eval loop for dashboard search and variable-scoped navigation

ID	MCP_ANALYST_AGENT-ADD_EVAL_LOOP_FOR_DASHBOARD_SEARCH_AND_VARIABLE_SCOPED_NAVIGATION
Status	not_started
Priority	p1
Milestone	m5-v1-2-launch
Owner	data-ai-engineer-architect

Problem

We are about to make dashboard search more powerful by returning variable-scoped deep links, but we have no disciplined way to measure whether those changes actually improve outcomes.

Basic retrieval metrics are not enough here. A search result can be "correct" at the dashboard title level and still fail the real task if:

it picks a dashboard that cannot express the requested scope through its variables
it resolves the wrong variable values
it silently drops a requested filter
it generates a broken or lossy deep link
it fails to explain partial or unsupported bindings

For this feature area, the real user task is:

"Given a natural-language request, can the system send me to the right dashboard already scoped to the right state?"

That means we need an eval loop that measures the full navigation handoff:

dashboard retrieval quality
variable-binding accuracy
deep-link generation correctness
partial-match behavior
regression trends over time

Without this, improvements to search_dashboards, ranking, binding inference, or URL serialization will be judged anecdotally and are likely to regress quietly.

Context

Relevant existing work: - tasks/workstreams/mcp-analyst-agent/tasks/expand-dashboard-search-to-return-variable-scoped-deep-links.md — feature task this eval should guard - tasks/workstreams/mcp-analyst-agent/tasks/add-catalog-discovery-evals-derived-from-sql-benchmark.md — existing retrieval-eval pattern - tasks/workstreams/mcp-analyst-agent/tasks/task-m2-agent-eval-loop-v1.md — broader end-to-end agent eval framing - apps/evals/ — unified home for eval infrastructure and leaderboard dashboards

Relationship to the feature task: - the deep-link task defines the product behavior we want - this task exists to make that behavior measurable before and after ranking, binding, or URL-contract changes

Why catalog retrieval eval is not sufficient: - catalog eval asks "did we retrieve the right tables?" - this task asks "did we retrieve the right dashboard and encode the right dashboard state?"

Those are related but distinct problems. Dashboard search includes a retrieval layer plus a parameter-resolution layer.

Likely product boundary under test: - dataface/ai/mcp/search.py — dashboard search and ranking - dataface/ai/mcp/tools.py — tool-level result shape - dashboard variable metadata extraction / indexing - URL or route-state serialization for variable-scoped navigation - receiving dashboard route behavior in Cloud

Eval case shape we need: Each case should include: - user request text - expected dashboard slug or allowed dashboard set - expected variable bindings - optional allowed partial bindings - expected failure mode when full mapping is impossible

Example:

{
  "prompt": "Open the renewal dashboard for Acme for Q4 2025",
  "expected_dashboard": "renewal-risk",
  "expected_bindings": {
    "customer": "Acme",
    "quarter": "2025-Q4"
  },
  "allowed_partial": [],
  "notes": "Should not choose account-overview even though it mentions renewal"
}

Scoring needs to be multi-part: - top-k dashboard retrieval - exact dashboard hit - variable binding exact match - per-variable precision/recall - deep-link validity / round-trip decode - explanation quality for partial or unsupported cases

Constraints: - no hidden magic: eval should reward explicit, declared variable mappings only - support deterministic scoring where possible - do not require full browser QA for every run; the core loop should stay cheap enough for regular regression runs - reserve slower browser validation for a small smoke subset if needed

Possible Solutions

Option A: Standalone dashboard-search eval suite under `apps/evals/` [Recommended]

Create a dedicated eval package, likely apps/evals/dashboard_search/, with: - curated JSONL dataset - runner that calls dashboard search / deep-link resolution - deterministic scorer for retrieval + binding correctness - per-run summaries and leaderboard-ready outputs

Pros: Clean boundary, repeatable, cheap to run, directly comparable across search/ranking changes.

Cons: Requires creating a new eval corpus rather than reusing the SQL benchmark wholesale.

Option B: Fold into existing catalog retrieval eval

Extend catalog eval to also evaluate dashboard search and variable bindings.

Pros: Fewer top-level eval packages.

Cons: Wrong abstraction. Table retrieval and dashboard navigation have different ground truth, different scorers, and different failure modes.

Option C: Only use end-to-end agent evals

Rely on broad agent eval prompts and judge whether the final user experience feels right.

Pros: Closest to real usage.

Cons: Too slow and too noisy. Hard to pinpoint whether a regression came from retrieval, binding inference, link serialization, or dashboard rendering.

Plan

Recommended approach: add a dedicated dashboard-search eval loop under apps/evals/ with deterministic scoring for retrieval and variable resolution.

Dataset

Create a small but high-signal benchmark corpus for dashboard navigation requests: - start with 25-50 hand-authored cases - cover direct dashboard lookup, customer/entity scoping, time scoping, multi-variable requests, ambiguous prompts, and unsupported-filter cases - include negative/edge cases where the correct behavior is partial match or explicit failure

Store as JSONL under something like: - apps/evals/data/dashboard_search_cases.jsonl

Runner

Add a runner, likely: - apps/evals/dashboard_search/runner.py

Responsibilities: - load eval cases - call the dashboard search / deep-link resolution surface under test - capture returned dashboard candidates, bindings, reasons, and open_url - write per-case JSONL output under apps/evals/output/dashboard_search/

Scorer

Add deterministic scorer logic, likely: - apps/evals/dashboard_search/scorer.py

Suggested metrics: - dashboard_hit_at_k - dashboard_mrr - exact_dashboard_match_rate - binding_exact_match_rate - binding_field_precision - binding_field_recall - deep_link_valid_rate - partial_match_explanation_rate

For unsupported or impossible cases: - score success when the system explicitly reports the unresolved constraint instead of pretending it succeeded

Shared reporting

Emit leaderboard-ready summaries: - overall metrics - breakdown by case type: entity filter, time filter, multi-variable, ambiguity, unsupported - trendable run metadata so search/ranking changes can be compared over time

Reviewer worksheet

Use a short templated checklist for the judgment-heavy parts of the eval. Deterministic scoring should remain the primary signal for retrieval, binding, and link validity, but reviewers should fill out a stable worksheet for cases where usefulness and trust matter more than exact-match logic.

The worksheet should be: - short - mostly yes/no or enum fields - stable across runs - paired with 1-3 short rationale fields

Suggested checklist fields: - intent_understood — did the system appear to understand the request? - correct_dashboard_selected — did it pick the right dashboard? - better_dashboard_existed — was there a clearly better candidate it missed? - entity_scope_applied — customer/account/entity filter applied correctly - time_scope_applied — time filter/range applied correctly - multi_variable_scope_preserved — multiple requested constraints survived together - unsupported_constraints_reported — unsupported requests were surfaced clearly - silent_fallback_present — any requested scope was dropped without clear explanation - deep_link_useful — link is not just valid, but useful to a real user - applied_scope_visible — receiving surface makes the active scope visible - trustworthy_explanation — explanation of success/partial/failure is believable - reviewer_verdict — pass, partial, or fail

Required short-text fields: - primary_failure_reason - suggested_fix_category - review_notes

Proposed reviewer output schema

{
  "case_id": "dashboard_search_001",
  "run_id": "2026-03-18T18-10-00Z",
  "reviewer": "agent",
  "intent_understood": true,
  "correct_dashboard_selected": true,
  "better_dashboard_existed": false,
  "entity_scope_applied": true,
  "time_scope_applied": true,
  "multi_variable_scope_preserved": true,
  "unsupported_constraints_reported": true,
  "silent_fallback_present": false,
  "deep_link_useful": true,
  "applied_scope_visible": true,
  "trustworthy_explanation": true,
  "reviewer_verdict": "pass",
  "primary_failure_reason": "",
  "suggested_fix_category": "",
  "review_notes": "Good match. Correct dashboard and bindings were visible on load."
}

Suggested enums: - reviewer_verdict: pass, partial, fail - suggested_fix_category: ranking, binding_inference, serialization, receiving_route, ux_visibility, explanation, other

The runner should support: - deterministic-only mode - deterministic plus worksheet mode - storing worksheet outputs separately from the raw runner outputs so human/agent review can be rerun without recomputing the base retrieval step

Optional browser smoke slice

For a tiny subset of cases, add an optional smoke check that: - opens the returned URL - verifies the dashboard route loads - verifies applied variables are visible in the UI

This should be a small follow-on or optional mode, not required for the main cheap loop.

Files likely involved

apps/evals/dashboard_search/runner.py
apps/evals/dashboard_search/scorer.py
apps/evals/dashboard_search/types.py
apps/evals/data/dashboard_search_cases.jsonl
apps/evals/output/dashboard_search/ (gitignored output)
tests/evals/dashboard_search/
unified CLI wiring in apps/evals/__main__.py

Implementation steps

Define the eval case schema and write the initial benchmark corpus.
Build a runner around the current dashboard search surface.
Build deterministic scoring for dashboard retrieval, variable binding, and deep-link validity.
Add reviewer worksheet schema, storage format, and a first-pass agent-review prompt/template.
Add CLI entrypoint and output summaries.
Add focused tests for scorer edge cases, worksheet schema validation, and case parsing.
Optionally add a small browser-smoke mode once the receiving dashboard route behavior stabilizes.

Relationship to the feature task

This task is the regression harness for: - expand-dashboard-search-to-return-variable-scoped-deep-links.md

The feature task changes behavior. This eval task makes that behavior measurable and keeps it from drifting.

Implementation Progress

Not started.

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A for primary implementation. Main validation should be deterministic runner/scorer tests plus manual spot-checks of a few representative cases. If a browser smoke slice is added later, validate only a small curated subset.

Review Feedback

[ ] Review cleared