Experiment: Catalog tool access with vs without tool
Hypothesis
Live catalog access will help the model on ambiguous or underspecified questions, but the gains may be smaller than the cost in latency and tool complexity once the preloaded context is already strong.
Method
Hold the model, prompt, benchmark subset, and preloaded schema payload fixed. Compare runs with the catalog tool enabled versus disabled so the experiment isolates tool availability rather than the schema fields themselves. Use the same scorer and logging format as the other experiments.
Variables
| Variable | Values |
|---|---|
| Independent | catalog tool access (enabled vs disabled) |
| Dependent | pass rate, grounding failures, tool-call count, latency, token cost |
| Controlled | model, prompt, canary subset, schema payload, scorer, seed, temperature |
Execution Log
Run 1: catalog tool enabled
- Command:
<exact command> - Config: fixed model, fixed schema payload, catalog tool enabled
- Output:
<path to results JSONL> - Started:
<timestamp> - Duration:
<time> - Notes:
<anything unexpected>
Run 2: catalog tool disabled
- Command:
<exact command> - Config: fixed model, fixed schema payload, catalog tool disabled
- Output:
<path to results JSONL> - Started:
<timestamp> - Duration:
<time> - Notes:
<anything unexpected>
Results
| Condition | Pass Rate | Avg Score | Parse | Grounding | Intent | Latency | Tokens |
|---|---|---|---|---|---|---|---|
| enabled | |||||||
| disabled |
Breakdowns
Analysis
What do the results tell you? Was the hypothesis confirmed?
Conclusion
What's the decision? What changes, if any, should go to production?