Add eval loop for dashboard search and variable-scoped navigation
Problem
We are about to make dashboard search more powerful by returning variable-scoped deep links, but we have no disciplined way to measure whether those changes actually improve outcomes.
Basic retrieval metrics are not enough here. A search result can be "correct" at the dashboard title level and still fail the real task if:
- it picks a dashboard that cannot express the requested scope through its variables
- it resolves the wrong variable values
- it silently drops a requested filter
- it generates a broken or lossy deep link
- it fails to explain partial or unsupported bindings
For this feature area, the real user task is:
"Given a natural-language request, can the system send me to the right dashboard already scoped to the right state?"
That means we need an eval loop that measures the full navigation handoff:
- dashboard retrieval quality
- variable-binding accuracy
- deep-link generation correctness
- partial-match behavior
- regression trends over time
Without this, improvements to search_dashboards, ranking, binding inference, or URL serialization will be judged anecdotally and are likely to regress quietly.
Context
Relevant existing work:
- tasks/workstreams/mcp-analyst-agent/tasks/expand-dashboard-search-to-return-variable-scoped-deep-links.md — feature task this eval should guard
- tasks/workstreams/mcp-analyst-agent/tasks/add-catalog-discovery-evals-derived-from-sql-benchmark.md — existing retrieval-eval pattern
- tasks/workstreams/mcp-analyst-agent/tasks/task-m2-agent-eval-loop-v1.md — broader end-to-end agent eval framing
- apps/evals/ — unified home for eval infrastructure and leaderboard dashboards
Relationship to the feature task: - the deep-link task defines the product behavior we want - this task exists to make that behavior measurable before and after ranking, binding, or URL-contract changes
Why catalog retrieval eval is not sufficient: - catalog eval asks "did we retrieve the right tables?" - this task asks "did we retrieve the right dashboard and encode the right dashboard state?"
Those are related but distinct problems. Dashboard search includes a retrieval layer plus a parameter-resolution layer.
Likely product boundary under test:
- dataface/ai/mcp/search.py — dashboard search and ranking
- dataface/ai/mcp/tools.py — tool-level result shape
- dashboard variable metadata extraction / indexing
- URL or route-state serialization for variable-scoped navigation
- receiving dashboard route behavior in Cloud
Eval case shape we need: Each case should include: - user request text - expected dashboard slug or allowed dashboard set - expected variable bindings - optional allowed partial bindings - expected failure mode when full mapping is impossible
Example:
{
"prompt": "Open the renewal dashboard for Acme for Q4 2025",
"expected_dashboard": "renewal-risk",
"expected_bindings": {
"customer": "Acme",
"quarter": "2025-Q4"
},
"allowed_partial": [],
"notes": "Should not choose account-overview even though it mentions renewal"
}
Scoring needs to be multi-part: - top-k dashboard retrieval - exact dashboard hit - variable binding exact match - per-variable precision/recall - deep-link validity / round-trip decode - explanation quality for partial or unsupported cases
Constraints: - no hidden magic: eval should reward explicit, declared variable mappings only - support deterministic scoring where possible - do not require full browser QA for every run; the core loop should stay cheap enough for regular regression runs - reserve slower browser validation for a small smoke subset if needed
Possible Solutions
Option A: Standalone dashboard-search eval suite under apps/evals/ [Recommended]
Create a dedicated eval package, likely apps/evals/dashboard_search/, with:
- curated JSONL dataset
- runner that calls dashboard search / deep-link resolution
- deterministic scorer for retrieval + binding correctness
- per-run summaries and leaderboard-ready outputs
Pros: Clean boundary, repeatable, cheap to run, directly comparable across search/ranking changes.
Cons: Requires creating a new eval corpus rather than reusing the SQL benchmark wholesale.
Option B: Fold into existing catalog retrieval eval
Extend catalog eval to also evaluate dashboard search and variable bindings.
Pros: Fewer top-level eval packages.
Cons: Wrong abstraction. Table retrieval and dashboard navigation have different ground truth, different scorers, and different failure modes.
Option C: Only use end-to-end agent evals
Rely on broad agent eval prompts and judge whether the final user experience feels right.
Pros: Closest to real usage.
Cons: Too slow and too noisy. Hard to pinpoint whether a regression came from retrieval, binding inference, link serialization, or dashboard rendering.
Plan
Recommended approach: add a dedicated dashboard-search eval loop under apps/evals/ with deterministic scoring for retrieval and variable resolution.
Dataset
Create a small but high-signal benchmark corpus for dashboard navigation requests: - start with 25-50 hand-authored cases - cover direct dashboard lookup, customer/entity scoping, time scoping, multi-variable requests, ambiguous prompts, and unsupported-filter cases - include negative/edge cases where the correct behavior is partial match or explicit failure
Store as JSONL under something like:
- apps/evals/data/dashboard_search_cases.jsonl
Runner
Add a runner, likely:
- apps/evals/dashboard_search/runner.py
Responsibilities:
- load eval cases
- call the dashboard search / deep-link resolution surface under test
- capture returned dashboard candidates, bindings, reasons, and open_url
- write per-case JSONL output under apps/evals/output/dashboard_search/
Scorer
Add deterministic scorer logic, likely:
- apps/evals/dashboard_search/scorer.py
Suggested metrics:
- dashboard_hit_at_k
- dashboard_mrr
- exact_dashboard_match_rate
- binding_exact_match_rate
- binding_field_precision
- binding_field_recall
- deep_link_valid_rate
- partial_match_explanation_rate
For unsupported or impossible cases: - score success when the system explicitly reports the unresolved constraint instead of pretending it succeeded
Shared reporting
Emit leaderboard-ready summaries: - overall metrics - breakdown by case type: entity filter, time filter, multi-variable, ambiguity, unsupported - trendable run metadata so search/ranking changes can be compared over time
Reviewer worksheet
Use a short templated checklist for the judgment-heavy parts of the eval. Deterministic scoring should remain the primary signal for retrieval, binding, and link validity, but reviewers should fill out a stable worksheet for cases where usefulness and trust matter more than exact-match logic.
The worksheet should be: - short - mostly yes/no or enum fields - stable across runs - paired with 1-3 short rationale fields
Suggested checklist fields:
- intent_understood — did the system appear to understand the request?
- correct_dashboard_selected — did it pick the right dashboard?
- better_dashboard_existed — was there a clearly better candidate it missed?
- entity_scope_applied — customer/account/entity filter applied correctly
- time_scope_applied — time filter/range applied correctly
- multi_variable_scope_preserved — multiple requested constraints survived together
- unsupported_constraints_reported — unsupported requests were surfaced clearly
- silent_fallback_present — any requested scope was dropped without clear explanation
- deep_link_useful — link is not just valid, but useful to a real user
- applied_scope_visible — receiving surface makes the active scope visible
- trustworthy_explanation — explanation of success/partial/failure is believable
- reviewer_verdict — pass, partial, or fail
Required short-text fields:
- primary_failure_reason
- suggested_fix_category
- review_notes
Proposed reviewer output schema
{
"case_id": "dashboard_search_001",
"run_id": "2026-03-18T18-10-00Z",
"reviewer": "agent",
"intent_understood": true,
"correct_dashboard_selected": true,
"better_dashboard_existed": false,
"entity_scope_applied": true,
"time_scope_applied": true,
"multi_variable_scope_preserved": true,
"unsupported_constraints_reported": true,
"silent_fallback_present": false,
"deep_link_useful": true,
"applied_scope_visible": true,
"trustworthy_explanation": true,
"reviewer_verdict": "pass",
"primary_failure_reason": "",
"suggested_fix_category": "",
"review_notes": "Good match. Correct dashboard and bindings were visible on load."
}
Suggested enums:
- reviewer_verdict: pass, partial, fail
- suggested_fix_category: ranking, binding_inference, serialization, receiving_route, ux_visibility, explanation, other
The runner should support: - deterministic-only mode - deterministic plus worksheet mode - storing worksheet outputs separately from the raw runner outputs so human/agent review can be rerun without recomputing the base retrieval step
Optional browser smoke slice
For a tiny subset of cases, add an optional smoke check that: - opens the returned URL - verifies the dashboard route loads - verifies applied variables are visible in the UI
This should be a small follow-on or optional mode, not required for the main cheap loop.
Files likely involved
apps/evals/dashboard_search/runner.pyapps/evals/dashboard_search/scorer.pyapps/evals/dashboard_search/types.pyapps/evals/data/dashboard_search_cases.jsonlapps/evals/output/dashboard_search/(gitignored output)tests/evals/dashboard_search/- unified CLI wiring in
apps/evals/__main__.py
Implementation steps
- Define the eval case schema and write the initial benchmark corpus.
- Build a runner around the current dashboard search surface.
- Build deterministic scoring for dashboard retrieval, variable binding, and deep-link validity.
- Add reviewer worksheet schema, storage format, and a first-pass agent-review prompt/template.
- Add CLI entrypoint and output summaries.
- Add focused tests for scorer edge cases, worksheet schema validation, and case parsing.
- Optionally add a small browser-smoke mode once the receiving dashboard route behavior stabilizes.
Relationship to the feature task
This task is the regression harness for:
- expand-dashboard-search-to-return-variable-scoped-deep-links.md
The feature task changes behavior. This eval task makes that behavior measurable and keeps it from drifting.
Implementation Progress
Not started.
QA Exploration
- [x] QA exploration completed (or N/A for non-UI tasks)
N/A for primary implementation. Main validation should be deterministic runner/scorer tests plus manual spot-checks of a few representative cases. If a browser smoke slice is added later, validate only a small curated subset.
Review Feedback
- [ ] Review cleared