Experiment: Model comparison GPT-5 vs Claude Sonnet
Hypothesis
GPT-5 will likely lead on hard SQL synthesis and grounding, while Claude Sonnet may stay close enough on simpler cases that cost and latency could matter more than raw quality.
Method
Run the same canary subset with the same prompt, scorer, and provisional baseline context/tool policy. Change only the model identifier between gpt-5 and claude-sonnet. This pair acts as the higher-end model comparison once the GPT-4o baseline has been measured.
Variables
| Variable | Values |
|---|---|
| Independent | model (gpt-5 vs claude-sonnet) |
| Dependent | pass rate, equivalence rate, latency, token cost |
| Controlled | prompt, canary subset, baseline context/tool policy, scorer, seed, temperature |
Execution Log
Run 1: GPT-5 baseline
- Command:
<exact command> - Config:
gpt-5, canary subset, fixed baseline context/tool policy - Output:
<path to results JSONL> - Started:
<timestamp> - Duration:
<time> - Notes:
<anything unexpected>
Run 2: Claude Sonnet treatment
- Command:
<exact command> - Config:
claude-sonnet, canary subset, fixed baseline context/tool policy - Output:
<path to results JSONL> - Started:
<timestamp> - Duration:
<time> - Notes:
<anything unexpected>
Results
| Condition | Pass Rate | Avg Score | Parse | Grounding | Intent | Latency | Tokens |
|---|---|---|---|---|---|---|---|
| GPT-5 | |||||||
| Claude Sonnet |
Breakdowns
Analysis
What do the results tell you? Was the hypothesis confirmed?
Conclusion
What's the decision? What changes, if any, should go to production?