Skill and tool quality evaluation framework

ID	MCP_ANALYST_AGENT-SKILL_AND_TOOL_QUALITY_EVALUATION_FRAMEWORK
Status	not_started
Priority	p2
Milestone	m3-public-launch
Owner	data-ai-engineer-architect

Problem

We ship 7 product-facing skills (building-dataface-dashboards, dataface-dashboard-design, dataface-report-design, reviewing-dataface-dashboards, troubleshooting-dataface, configuring-dataface-mcp, dataface-skill-creator) alongside 6 MCP tools and 6 MCP resources. We have no systematic way to answer:

Does each skill actually improve output? A skill written for an older model may actively constrain a newer, smarter one. Anthropic found improvement opportunities in 5 of 6 of their own internal skills when they A/B tested.
Do skills fire on the right requests? Description-based triggering is fragile — vague descriptions create "Silent Skills" that never fire, while overbroad ones hijack unrelated conversations.
Do skills conflict with each other? With 7 product skills, trigger overlap is a real risk (e.g., "my chart isn't working" could match building-dataface-dashboards, troubleshooting-dataface, or reviewing-dataface-dashboards).
Are tool descriptions optimal? MCP tool descriptions in tool_schemas.py drive how agents decide which tool to call. Suboptimal descriptions lead to wrong tool selection.

Without a quality framework, skill/tool quality degrades silently as models improve and the skill set grows.

Context

Existing infrastructure: - alie-eval system (apps/a_lie/evals/) — generates dashboards from test prompts, screenshots, and LLM-scored review. Closest analog but evaluates end-to-end dashboard quality, not individual skill contribution. - 7 product skills in dataface/ai/skills/ — all have YAML frontmatter with descriptions, anti-sycophancy tables, and progressive disclosure. - 6 MCP tools defined in dataface/ai/tool_schemas.py — canonical tool definitions. - 6 MCP resources defined in dataface/ai/mcp/server.py — dynamic data served on demand. - dataface/ai/skills/dataface-skill-creator/SKILL.md — meta-skill with quality checklist for authoring skills.

Reference: The Ultimate Guide to Claude Skills — describes Skills 2.0 eval, A/B benchmarking, and description optimization patterns.

Related tasks: - task-m2-agent-eval-loop-v1.md — general eval loop with internal analysts (broader scope) - task-m2-internal-adoption-design-partners-03-quality-standards.md — quality guardrails (policy, not tooling) - mcp-and-skills-auto-install-across-all-ai-clients.md — skill distribution (Phase 3: MCP Prompts)

Possible Solutions

Option A: Extend alie-eval with skill isolation [Recommended]

Add a --skills flag to the eval loop that runs each test prompt multiple times: 1. With all skills loaded (current behavior) 2. With no skills (raw MCP tools only) 3. With each skill individually

Score each run. Compare. Report which skills help, which hurt, which are neutral.

Trade-offs: Leverages existing eval infrastructure. Requires the eval runner to control which skills are loaded, which may need MCP server config changes. Most actionable output — directly identifies skills to retire, improve, or keep.

Option B: Dedicated trigger accuracy test suite

Maintain a fixture file of (prompt, expected_skill) pairs. For each prompt, check whether the agent selects the correct skill. Report false positives (wrong skill fires), false negatives (right skill doesn't fire), and conflicts (multiple skills compete).

Trade-offs: Focused on triggering accuracy only, not output quality. Faster to run. Requires mocking the skill selection mechanism, which varies by client. Good complement to Option A.

Option C: Description optimization via prompt testing

For each skill, generate 20+ prompts that should trigger it and 20+ that should not. Test trigger rates. Auto-suggest description improvements when trigger accuracy drops below threshold.

Trade-offs: Most surgical — only targets the description field. Cheap to run. Doesn't measure whether the skill's body instructions actually help output quality.

Recommended: A + B combined, C as a lightweight add-on

A for quarterly skill audits: Do our skills actually improve dashboard quality?
B for CI: Does every PR that changes a skill description maintain trigger accuracy?
C as a just skill-optimize <name> command for ad-hoc description tuning.

Plan

Phase 1: Skill A/B benchmarking (extends alie-eval)

Files to modify: - apps/a_lie/eval_runner.py — add --skills / --no-skills flags - apps/a_lie/eval_rubric.md — add skill-attribution scoring criteria - apps/a_lie/eval_review.py — add comparison report generation

Steps: 1. Add ability to run eval prompts with configurable skill sets 2. Run each prompt twice: with skills and without 3. Generate comparison report: per-skill delta in quality scores 4. Flag skills where raw Claude scores higher (candidates for retirement) 5. Flag skills where skill-loaded Claude scores much higher (high-value skills)

Acceptance criteria: - just alie-eval --compare-skills produces a report showing per-skill quality delta - Can identify whether each skill helps, hurts, or is neutral for each eval prompt

Phase 2: Trigger accuracy test suite

Files to create: - tests/ai/test_skill_triggers.py — trigger accuracy fixtures - dataface/ai/skills/trigger_fixtures.yaml — (prompt, expected_skill, not_expected_skills) pairs

Steps: 1. Define 10+ prompts per skill with expected trigger mappings 2. Define 10+ negative prompts per skill that should NOT trigger it 3. Test that description-based matching selects correctly 4. Add to CI as a lightweight check

Phase 3: Description optimizer

Files to create: - scripts/skill-optimize — CLI tool

Steps: 1. For a given skill, generate diverse trigger prompts programmatically 2. Test current description against them 3. Score trigger accuracy 4. Suggest revised description text if accuracy is below 80%

Implementation Progress

Not yet started.

Review Feedback

[ ] Review cleared