Experiment: Layer scope all vs gold-only vs gold+silver
Hypothesis
Reducing scope from all tables to gold-only or gold-plus-silver should reduce noise and hallucinated table usage. Gold-only may be the cleanest setting overall, but gold-plus-silver might preserve enough breadth for edge cases.
Method
Use the winning model and the chosen schema/tool configuration from the earlier waves, then vary only the exposed table scope across all, gold-only, and gold+silver. Run this together with the schema curation task so the results can directly inform which tables should be surfaced in the benchmark and catalog.
Variables
| Variable | Values |
|---|---|
| Independent | table scope (all, gold-only, gold+silver) |
| Dependent | pass rate, grounding failures, hallucinated tables, latency, token cost |
| Controlled | model, prompt, canary subset, schema/tool strategy, scorer, seed, temperature |
Execution Log
Run 1: all tables
- Command:
<exact command> - Config: fixed model, chosen schema/tool strategy, all tables visible
- Output:
<path to results JSONL> - Started:
<timestamp> - Duration:
<time> - Notes:
<anything unexpected>
Run 2: gold-only
- Command:
<exact command> - Config: fixed model, chosen schema/tool strategy, gold-only tables visible
- Output:
<path to results JSONL> - Started:
<timestamp> - Duration:
<time> - Notes:
<anything unexpected>
Run 3: gold+silver
- Command:
<exact command> - Config: fixed model, chosen schema/tool strategy, gold+silver tables visible
- Output:
<path to results JSONL> - Started:
<timestamp> - Duration:
<time> - Notes:
<anything unexpected>
Results
| Condition | Pass Rate | Avg Score | Parse | Grounding | Intent | Latency | Tokens |
|---|---|---|---|---|---|---|---|
| all | |||||||
| gold-only | |||||||
| gold+silver |
Breakdowns
Analysis
What do the results tell you? Was the hypothesis confirmed?
Conclusion
What's the decision? What changes, if any, should go to production?