Run context and model ablation experiments
Problem
Define and execute the initial experiment matrix using the eval system. Compare models (GPT-4o, GPT-5, Claude Sonnet, etc.), schema-tool strategies, context levels (table/column names only, +types, +descriptions, +profile stats, +sample values), and with/without catalog tool access. Measure which context fields actually improve SQL quality and which are noise. Capture results in the eval leaderboard dashboards. This is where the eval system proves its value — the experiments are the point.
Context
This is a planning task, not a giant one-off execution task
This task defines the experiment matrix and spawns individual experiment tasks. Each experiment gets its own task file using the experiment worksheet from the run-experiment skill (.codex/skills/run-experiment/SKILL.md). This keeps a clean log of hypothesis, method, results, and conclusions per experiment.
It also owns the decision to run schema-scope curation together with context/tool ablations, not as a separate disconnected track.
Per-experiment task pattern
When ready to run an experiment:
- Create a task:
just task create --workstream mcp-analyst-agent --title "Experiment: <description>" --initiative ai-quality-experimentation-and-context-optimization - Replace the generated body with the experiment worksheet template from the
run-experimentskill. - Fill in hypothesis and method before running.
- Execute, analyze, conclude.
Experiment matrix (planned)
Model comparison: - GPT-4o vs GPT-5 vs Claude Sonnet — same context, same prompt, same benchmark subset - One task per model pair comparison
Context field ablation: - L0: no schema tool / no preloaded context - L1: table + column names only - L2: L1 + types - L3: L2 + descriptions (from dbt schema.yml) - L4: L3 + profile stats (row counts, distributions, nulls) - L5: L4 + sample/top values - Each level is a run against the canary set; start with L0 vs L1 vs L3 vs L5 to find the big jumps
Schema tool strategy:
- Profiled catalog tool (full fields)
- Profiled catalog tool with filtered fields
- Live INFORMATION_SCHEMA path
- Checked-in memory file / snapshot
- No tool
This is a core question, not an implementation detail. The eval system should tell us whether richer profiling helps enough to justify its cost and complexity.
Catalog tool access: - With vs without the catalog/schema tool entirely - Preloaded context only vs on-demand tool use
Layer scope (run together with schema curation task): - All tables vs gold-only vs gold+silver — does seeing raw/staging tables help or hurt? - Measure both quality and latency/token cost, not just pass rate
What not to do yet
- Do not turn M1 into a full regression-suite project.
- Broad automated eval gates can wait until M4 once the benchmark, scorer, and experiment matrix stabilize.
- For now, prioritize canary runs and targeted experiment tasks that answer product questions.
Dependencies
- Depends on eval runner (task 3), cleaned benchmark (task 2), extracted generate_sql function, and leaderboard dashboards being operational.
- The schema curation task is closely related — layer scope experiments feed into curation decisions.
Possible Solutions
- Broad factorial sweep across models, context levels, tool strategies, and scope settings in one run. Rejected because it creates too many interactions and makes the results hard to interpret.
- Sequenced single-variable ablations with a fixed baseline. Recommended. Use one provisional model first, then vary context, tool strategy, and scope one dimension at a time so each result answers a specific product question.
Plan
Planned matrix
| Wave | Experiment task | Independent variable(s) | Fixed baseline | Decision supported |
|---|---|---|---|---|
| 1 | Model comparison GPT-4o vs GPT-5 | model | canary subset, same prompt, same provisional context/tool stack | pick the provisional default model |
| 1 | Model comparison GPT-4o vs Claude Sonnet | model | canary subset, same prompt, same provisional context/tool stack | confirm whether Sonnet is competitive |
| 1 | Model comparison GPT-5 vs Claude Sonnet | model | canary subset, same prompt, same provisional context/tool stack | determine whether GPT-5 materially outperforms Sonnet |
| 2 | Context ablation L0 vs L1 vs L3 vs L5 | schema context level | winning model from Wave 1, same prompt, same canary subset, no tool-policy changes | identify which schema fields are signal vs noise |
| 3 | Schema tool strategy profiled vs filtered vs INFORMATION_SCHEMA vs none | schema acquisition path | winning model from Waves 1-2, fixed context policy, same canary subset | choose the best schema source strategy |
| 3 | Catalog tool access with vs without tool | catalog tool availability | winning model and fixed schema payload, same canary subset | decide whether live tool access is worth the overhead |
| 4 | Layer scope all vs gold-only vs gold+silver | table-scope policy | winning model and chosen schema/tool strategy, same canary subset | inform schema curation and benchmark scope |
Sequencing
- Run the three model pairwise comparisons first so the rest of the matrix can use the best model as a constant.
- Run context ablation second, holding model and tool policy fixed, because context-field value is the primary question.
- Run schema-tool strategy and catalog-tool-access experiments next so we can separate payload quality from tool availability.
- Run layer-scope last, in lockstep with schema curation work, because scope decisions feed directly into what the benchmark should expose.
- Keep all runs on the canary subset for iteration speed; only promote a final confirmation run to the full benchmark if the early signals are clear.
Execution plan
- Use the canary benchmark in
apps/evals/data/canary.jsonland keep the prompt/scorer/versioning fixed across a wave. - Capture each experiment in its own task worksheet using the
run-experimenttemplate, with hypothesis and method filled in before the first run. - Log exact commands, backend metadata, and output paths in each experiment task so the leaderboard can be reproduced later.
- After each wave, compare the dashboard summaries and only advance the winning configuration to the next wave.
- Stop once the matrix answers the product questions above; do not expand into a broad regression program yet.
Implementation Progress
Experiments will be logged as individual tasks. Link them here as they're created:
- [ ] Experiment: Model comparison GPT-4o vs GPT-5
- [ ] Experiment: Model comparison GPT-4o vs Claude Sonnet
- [ ] Experiment: Model comparison GPT-5 vs Claude Sonnet
- [ ] Experiment: Context ablation L0 vs L1 vs L3 vs L5
- [ ] Experiment: Schema tool strategy profiled vs filtered vs INFORMATION_SCHEMA vs none
- [ ] Experiment: Catalog tool access with vs without tool
- [ ] Experiment: Layer scope all vs gold-only vs gold+silver
QA Exploration
- [x] QA exploration completed (or N/A for non-UI tasks)
N/A — this is a planning and coordination task.
Review Feedback
- [ ] Review cleared