Dataface Tasks

Experiment: Model comparison GPT-5 vs Claude Sonnet

IDMCP_ANALYST_AGENT-EXPERIMENT_MODEL_COMPARISON_GPT_5_VS_CLAUDE_SONNET
Statusnot_started
Priorityp2
Milestonem2-internal-adoption-design-partners
Ownerdata-ai-engineer-architect
Initiativeai-quality-experimentation-and-context-optimization

Hypothesis

GPT-5 will likely lead on hard SQL synthesis and grounding, while Claude Sonnet may stay close enough on simpler cases that cost and latency could matter more than raw quality.

Method

Run the same canary subset with the same prompt, scorer, and provisional baseline context/tool policy. Change only the model identifier between gpt-5 and claude-sonnet. This pair acts as the higher-end model comparison once the GPT-4o baseline has been measured.

Variables

Variable Values
Independent model (gpt-5 vs claude-sonnet)
Dependent pass rate, equivalence rate, latency, token cost
Controlled prompt, canary subset, baseline context/tool policy, scorer, seed, temperature

Execution Log

Run 1: GPT-5 baseline

  • Command: <exact command>
  • Config: gpt-5, canary subset, fixed baseline context/tool policy
  • Output: <path to results JSONL>
  • Started: <timestamp>
  • Duration: <time>
  • Notes: <anything unexpected>

Run 2: Claude Sonnet treatment

  • Command: <exact command>
  • Config: claude-sonnet, canary subset, fixed baseline context/tool policy
  • Output: <path to results JSONL>
  • Started: <timestamp>
  • Duration: <time>
  • Notes: <anything unexpected>

Results

Condition Pass Rate Avg Score Parse Grounding Intent Latency Tokens
GPT-5
Claude Sonnet

Breakdowns

Analysis

What do the results tell you? Was the hypothesis confirmed?

Conclusion

What's the decision? What changes, if any, should go to production?

Follow-up experiments