Dataface Tasks

Experiment: Model comparison GPT-4o vs GPT-5

IDMCP_ANALYST_AGENT-EXPERIMENT_MODEL_COMPARISON_GPT_4O_VS_GPT_5
Statusnot_started
Priorityp2
Milestonem2-internal-adoption-design-partners
Ownerdata-ai-engineer-architect
Initiativeai-quality-experimentation-and-context-optimization

Hypothesis

GPT-5 will outperform GPT-4o on pass rate and semantic equivalence for harder SQL questions, while GPT-4o may remain competitive on simple cases and may be cheaper or faster to run.

Method

Run the same canary subset with the same prompt, same scorer, and the same provisional baseline context/tool policy. Change only the model identifier between gpt-4o and gpt-5. Record pass rate, parse rate, grounding rate, equivalence, latency, and token cost so the model decision is not based on a single headline number.

Variables

Variable Values
Independent model (gpt-4o vs gpt-5)
Dependent pass rate, equivalence rate, latency, token cost
Controlled prompt, canary subset, baseline context/tool policy, scorer, seed, temperature

Execution Log

Run 1: GPT-4o baseline

  • Command: <exact command>
  • Config: gpt-4o, canary subset, fixed baseline context/tool policy
  • Output: <path to results JSONL>
  • Started: <timestamp>
  • Duration: <time>
  • Notes: <anything unexpected>

Run 2: GPT-5 treatment

  • Command: <exact command>
  • Config: gpt-5, canary subset, fixed baseline context/tool policy
  • Output: <path to results JSONL>
  • Started: <timestamp>
  • Duration: <time>
  • Notes: <anything unexpected>

Results

Condition Pass Rate Avg Score Parse Grounding Intent Latency Tokens
GPT-4o
GPT-5

Breakdowns

Analysis

What do the results tell you? Was the hypothesis confirmed?

Conclusion

What's the decision? What changes, if any, should go to production?

Follow-up experiments