Dataface Tasks

Experiment: Layer scope all vs gold-only vs gold+silver

IDMCP_ANALYST_AGENT-EXPERIMENT_LAYER_SCOPE_ALL_VS_GOLD_ONLY_VS_GOLD_SILVER
Statusnot_started
Priorityp2
Milestonem2-internal-adoption-design-partners
Ownerdata-ai-engineer-architect
Initiativeai-quality-experimentation-and-context-optimization

Hypothesis

Reducing scope from all tables to gold-only or gold-plus-silver should reduce noise and hallucinated table usage. Gold-only may be the cleanest setting overall, but gold-plus-silver might preserve enough breadth for edge cases.

Method

Use the winning model and the chosen schema/tool configuration from the earlier waves, then vary only the exposed table scope across all, gold-only, and gold+silver. Run this together with the schema curation task so the results can directly inform which tables should be surfaced in the benchmark and catalog.

Variables

Variable Values
Independent table scope (all, gold-only, gold+silver)
Dependent pass rate, grounding failures, hallucinated tables, latency, token cost
Controlled model, prompt, canary subset, schema/tool strategy, scorer, seed, temperature

Execution Log

Run 1: all tables

  • Command: <exact command>
  • Config: fixed model, chosen schema/tool strategy, all tables visible
  • Output: <path to results JSONL>
  • Started: <timestamp>
  • Duration: <time>
  • Notes: <anything unexpected>

Run 2: gold-only

  • Command: <exact command>
  • Config: fixed model, chosen schema/tool strategy, gold-only tables visible
  • Output: <path to results JSONL>
  • Started: <timestamp>
  • Duration: <time>
  • Notes: <anything unexpected>

Run 3: gold+silver

  • Command: <exact command>
  • Config: fixed model, chosen schema/tool strategy, gold+silver tables visible
  • Output: <path to results JSONL>
  • Started: <timestamp>
  • Duration: <time>
  • Notes: <anything unexpected>

Results

Condition Pass Rate Avg Score Parse Grounding Intent Latency Tokens
all
gold-only
gold+silver

Breakdowns

Analysis

What do the results tell you? Was the hypothesis confirmed?

Conclusion

What's the decision? What changes, if any, should go to production?

Follow-up experiments