Run context and model ablation experiments

ID	MCP_ANALYST_AGENT-RUN_CONTEXT_AND_MODEL_ABLATION_EXPERIMENTS
Status	completed
Priority	p2
Milestone	m2-internal-adoption-design-partners
Owner	data-ai-engineer-architect
Initiative	ai-quality-experimentation-and-context-optimization
Completed by	dave
Completed	2026-03-18

Problem

Define and execute the initial experiment matrix using the eval system. Compare models (GPT-4o, GPT-5, Claude Sonnet, etc.), schema-tool strategies, context levels (table/column names only, +types, +descriptions, +profile stats, +sample values), and with/without catalog tool access. Measure which context fields actually improve SQL quality and which are noise. Capture results in the eval leaderboard dashboards. This is where the eval system proves its value — the experiments are the point.

Context

This is a planning task, not a giant one-off execution task

This task defines the experiment matrix and spawns individual experiment tasks. Each experiment gets its own task file using the experiment worksheet from the run-experiment skill (.codex/skills/run-experiment/SKILL.md). This keeps a clean log of hypothesis, method, results, and conclusions per experiment.

It also owns the decision to run schema-scope curation together with context/tool ablations, not as a separate disconnected track.

Per-experiment task pattern

When ready to run an experiment:

Create a task: just task create --workstream mcp-analyst-agent --title "Experiment: <description>" --initiative ai-quality-experimentation-and-context-optimization
Replace the generated body with the experiment worksheet template from the run-experiment skill.
Fill in hypothesis and method before running.
Execute, analyze, conclude.

Experiment matrix (planned)

Model comparison: - GPT-4o vs GPT-5 vs Claude Sonnet — same context, same prompt, same benchmark subset - One task per model pair comparison

Context field ablation: - L0: no schema tool / no preloaded context - L1: table + column names only - L2: L1 + types - L3: L2 + descriptions (from dbt schema.yml) - L4: L3 + profile stats (row counts, distributions, nulls) - L5: L4 + sample/top values - Each level is a run against the canary set; start with L0 vs L1 vs L3 vs L5 to find the big jumps

Schema tool strategy: - Profiled catalog tool (full fields) - Profiled catalog tool with filtered fields - Live INFORMATION_SCHEMA path - Checked-in memory file / snapshot - No tool

This is a core question, not an implementation detail. The eval system should tell us whether richer profiling helps enough to justify its cost and complexity.

Catalog tool access: - With vs without the catalog/schema tool entirely - Preloaded context only vs on-demand tool use

Layer scope (run together with schema curation task): - All tables vs gold-only vs gold+silver — does seeing raw/staging tables help or hurt? - Measure both quality and latency/token cost, not just pass rate

What not to do yet

Do not turn M1 into a full regression-suite project.
Broad automated eval gates can wait until M4 once the benchmark, scorer, and experiment matrix stabilize.
For now, prioritize canary runs and targeted experiment tasks that answer product questions.

Dependencies

Depends on eval runner (task 3), cleaned benchmark (task 2), extracted generate_sql function, and leaderboard dashboards being operational.
The schema curation task is closely related — layer scope experiments feed into curation decisions.

Possible Solutions

Broad factorial sweep across models, context levels, tool strategies, and scope settings in one run. Rejected because it creates too many interactions and makes the results hard to interpret.
Sequenced single-variable ablations with a fixed baseline. Recommended. Use one provisional model first, then vary context, tool strategy, and scope one dimension at a time so each result answers a specific product question.

Plan

Planned matrix

Wave	Experiment task	Independent variable(s)	Fixed baseline	Decision supported
1	Model comparison GPT-4o vs GPT-5	model	canary subset, same prompt, same provisional context/tool stack	pick the provisional default model
1	Model comparison GPT-4o vs Claude Sonnet	model	canary subset, same prompt, same provisional context/tool stack	confirm whether Sonnet is competitive
1	Model comparison GPT-5 vs Claude Sonnet	model	canary subset, same prompt, same provisional context/tool stack	determine whether GPT-5 materially outperforms Sonnet
2	Context ablation L0 vs L1 vs L3 vs L5	schema context level	winning model from Wave 1, same prompt, same canary subset, no tool-policy changes	identify which schema fields are signal vs noise
3	Schema tool strategy profiled vs filtered vs INFORMATION_SCHEMA vs none	schema acquisition path	winning model from Waves 1-2, fixed context policy, same canary subset	choose the best schema source strategy
3	Catalog tool access with vs without tool	catalog tool availability	winning model and fixed schema payload, same canary subset	decide whether live tool access is worth the overhead
4	Layer scope all vs gold-only vs gold+silver	table-scope policy	winning model and chosen schema/tool strategy, same canary subset	inform schema curation and benchmark scope

Sequencing

Run the three model pairwise comparisons first so the rest of the matrix can use the best model as a constant.
Run context ablation second, holding model and tool policy fixed, because context-field value is the primary question.
Run schema-tool strategy and catalog-tool-access experiments next so we can separate payload quality from tool availability.
Run layer-scope last, in lockstep with schema curation work, because scope decisions feed directly into what the benchmark should expose.
Keep all runs on the canary subset for iteration speed; only promote a final confirmation run to the full benchmark if the early signals are clear.

Execution plan

Use the canary benchmark in apps/evals/data/canary.jsonl and keep the prompt/scorer/versioning fixed across a wave.
Capture each experiment in its own task worksheet using the run-experiment template, with hypothesis and method filled in before the first run.
Log exact commands, backend metadata, and output paths in each experiment task so the leaderboard can be reproduced later.
After each wave, compare the dashboard summaries and only advance the winning configuration to the next wave.
Stop once the matrix answers the product questions above; do not expand into a broad regression program yet.

Implementation Progress

Experiments will be logged as individual tasks. Link them here as they're created:

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A — this is a planning and coordination task.

Review Feedback

[ ] Review cleared