AI Quality Experimentation and Context Optimization
Objective
Use the eval system to systematically discover what improves AI-generated SQL quality. Run model comparisons, context ablation experiments, schema curation studies, and memory strategy evals. The benchmark initiative builds the measurement system; this initiative uses it to find the right configuration for production.
Milestone placement
This initiative also belongs in M2, not M1. It assumes the eval system exists and asks optimization questions about text-to-SQL quality, schema tooling, and memory strategies. That is hardening work for internal adoption, not a prerequisite for the initial analyst pilot.
How experiments work
Each experiment gets its own task file using the experiment worksheet template from the run-experiment skill (.codex/skills/run-experiment/SKILL.md). This produces a structured log per experiment: hypothesis, method, execution log, results, analysis, and conclusion. The planning task defines the experiment matrix; individual experiment tasks capture execution and findings.
Tasks
Planning
- Run context and model ablation experiments — P2 — Defines the experiment matrix and spawns per-experiment tasks
Experiments (created as executed)
Individual experiment tasks will be created under this initiative as they're run. See the planning task for the initial matrix.
Supporting
- Curate schema and table scope for eval benchmark — P2 — Decide which schemas/tables/layers to include; partly experimental
- Add persistent analyst memories and learned context — P2 — Design and eval memory strategies
Dependency on benchmark initiative
This initiative depends on the Benchmark-Driven Text-to-SQL and Discovery Evals initiative for the eval runner, cleaned benchmark, extracted generate_sql function, and leaderboard dashboards. The benchmark initiative builds the measurement system; this initiative uses it.