Experiment: Model comparison GPT-4o vs Claude Sonnet
Hypothesis
Claude Sonnet may be competitive on question interpretation, but GPT-4o is likely to hold up better on schema grounding and SQL reliability for the same fixed context stack.
Method
Run the same canary subset with the same prompt, scorer, and provisional baseline context/tool policy. Change only the model identifier between gpt-4o and claude-sonnet. Keep everything else fixed so the result isolates model behavior rather than prompt or context changes.
Variables
| Variable | Values |
|---|---|
| Independent | model (gpt-4o vs claude-sonnet) |
| Dependent | pass rate, equivalence rate, latency, token cost |
| Controlled | prompt, canary subset, baseline context/tool policy, scorer, seed, temperature |
Execution Log
Run 1: GPT-4o baseline
- Command:
<exact command> - Config:
gpt-4o, canary subset, fixed baseline context/tool policy - Output:
<path to results JSONL> - Started:
<timestamp> - Duration:
<time> - Notes:
<anything unexpected>
Run 2: Claude Sonnet treatment
- Command:
<exact command> - Config:
claude-sonnet, canary subset, fixed baseline context/tool policy - Output:
<path to results JSONL> - Started:
<timestamp> - Duration:
<time> - Notes:
<anything unexpected>
Results
| Condition | Pass Rate | Avg Score | Parse | Grounding | Intent | Latency | Tokens |
|---|---|---|---|---|---|---|---|
| GPT-4o | |||||||
| Claude Sonnet |
Breakdowns
Analysis
What do the results tell you? Was the hypothesis confirmed?
Conclusion
What's the decision? What changes, if any, should go to production?