AI Quality Experimentation and Context Optimization

Planned · M2 — Internal Adoption + Design Partners4 / 10 (40%)

Objective

Use the eval system to systematically discover what improves AI-generated SQL quality. Run model comparisons, context ablation experiments, schema curation studies, and memory strategy evals. The benchmark initiative builds the measurement system; this initiative uses it to find the right configuration for production.

Milestone placement

This initiative also belongs in M2, not M1. It assumes the eval system exists and asks optimization questions about text-to-SQL quality, schema tooling, and memory strategies. That is hardening work for internal adoption, not a prerequisite for the initial analyst pilot.

How experiments work

Each experiment gets its own task file using the experiment worksheet template from the run-experiment skill (.codex/skills/run-experiment/SKILL.md). This produces a structured log per experiment: hypothesis, method, execution log, results, analysis, and conclusion. The planning task defines the experiment matrix; individual experiment tasks capture execution and findings.

Tasks

Planning

Run context and model ablation experiments — P2 — Defines the experiment matrix and spawns per-experiment tasks

Experiments (created as executed)

Individual experiment tasks will be created under this initiative as they're run. See the planning task for the initial matrix.

Supporting

Curate schema and table scope for eval benchmark — P2 — Decide which schemas/tables/layers to include; partly experimental
Add persistent analyst memories and learned context — P2 — Design and eval memory strategies

Dependency on benchmark initiative

This initiative depends on the Benchmark-Driven Text-to-SQL and Discovery Evals initiative for the eval runner, cleaned benchmark, extracted generate_sql function, and leaderboard dashboards. The benchmark initiative builds the measurement system; this initiative uses it.