Dataface Tasks

Experiment: Context ablation L0 vs L1 vs L3 vs L5

IDMCP_ANALYST_AGENT-EXPERIMENT_CONTEXT_ABLATION_L0_VS_L1_VS_L3_VS_L5
Statuscompleted
Priorityp2
Milestonem2-internal-adoption-design-partners
Ownerdata-ai-engineer-architect
Initiativeai-quality-experimentation-and-context-optimization
Completed bydave
Completed2026-03-26

Hypothesis

Adding descriptions should produce the largest jump in SQL quality, with profile stats and sample values helping on more ambiguous or aggregation-heavy cases. Names-only context should be the weakest setting.

Method

Use the winning model from the model-comparison wave, keep the prompt and benchmark subset fixed, and vary only the schema context level across L0, L1, L3, and L5. Keep tool policy unchanged during this experiment so the effect of the preloaded context itself is isolated.

Current execution note: the cleaned benchmark does not carry precomputed schema_context, so richer levels have to be constructed at backend time. This branch now has a dbt-package context provider that fetches model YAML + docs blocks from the public Fivetran dbt package repos, which is enough for L1 and L3. It also now has an inspect.json-backed provider for true L5 rich profile context. The remaining limitation is not code-path support anymore; it is data availability. The current dbt-package benchmark does not yet have matching inspection artifacts per benchmark schema, so the checked-in dundersign example profile can prove the provider works but cannot be used to score the existing cto-research canary rows. As of 2026-03-25, the unified apps.evals CLI also loads a local .env from the worktree root so provider credentials resolve consistently inside task worktrees. As of 2026-03-26, the scorer also records judge attempts, parse failures, and fallback usage into both per-case records and top-line run summaries.

Variables

Variable Values
Independent schema context level (L0, L1, L3, L5)
Dependent pass rate, equivalence rate, grounding failures, latency, token cost
Controlled model, prompt, canary subset, tool policy, scorer, seed, temperature

Execution Log

Run 1: L0 names only

  • Command: uv run python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --judge-model claude-sonnet-4-6 --judge-provider anthropic --limit 5
  • Config: provisional baseline model claude-sonnet-4-6, L0 / no schema context, canary subset limited to 5 cases
  • Output: apps/evals/output/sql/20260318_215106_raw_llm/results.jsonl
  • Started: 2026-03-18
  • Duration: <not captured>
  • Notes: First successful end-to-end run on this task branch. Pass and parse were both 0.000, which is directionally expected for a no-context baseline. Setup discoveries: (1) the Anthropic account does not expose claude-3-5-sonnet-latest, so the experiment branch now uses claude-sonnet-4-6; (2) the eval scorer needed a guard for tokenizer failures when models return prose instead of SQL.

Run 2: L1 names-only package context

  • Command: uv run python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider dbt_package --context-level l1 --judge-model claude-sonnet-4-6 --judge-provider anthropic --limit 5
  • Config: provisional baseline model claude-sonnet-4-6, L1 via public dbt package metadata (table + column names only), same 5-case canary slice
  • Output: apps/evals/output/sql/20260318_230727_raw_llm/results.jsonl
  • Started: 2026-03-18
  • Duration: <not captured>
  • Notes: This run used a new backend-time context provider instead of baking schema payloads into the benchmark. It improved the same 5-case slice from 0.000 pass / 0.000 parse to 0.600 pass / 1.000 parse. Two judge responses were malformed JSON in logs, but the run completed and wrote artifacts.

Run 3: L3 package context with descriptions

  • Command: uv run python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider dbt_package --context-level l3 --judge-model claude-sonnet-4-6 --judge-provider anthropic --limit 5
  • Config: provisional baseline model claude-sonnet-4-6, L3 via public dbt package metadata (table names, column names, descriptions/docs), same 5-case canary slice
  • Output: apps/evals/output/sql/20260318_230743_raw_llm/results.jsonl
  • Started: 2026-03-18
  • Duration: <not captured>
  • Notes: L3 matched L1 on pass rate (0.600) and parse rate (1.000) on this slice, but did not improve equivalence (0.400 vs 0.600 on L1). This suggests descriptions are not an obvious win on the tiny ad-reporting sample and may add noise or simply need a broader slice before deciding.

Run 4: L5 full context

  • Command: uv run python - <<'PY' ... make_inspect_cache_context_provider(inspect_path='examples/target/inspect.json', context_level='l5') ... PY
  • Config: direct provider verification against the checked-in example inspection artifact, using a matching tables_used=['users'] example case
  • Output: in-process provider output only (no benchmark run artifact yet)
  • Started: 2026-03-18
  • Duration: <not captured>
  • Notes: The L5 provider path is now implemented and produces real rich context from inspect.json artifacts, including row_count, grain, primary_date, semantic tags, and categorical values. What remains blocked is a benchmark-aligned L5 eval run, because the current dbt-package benchmark cases do not yet have corresponding profiled inspection artifacts for their source schemas.

Run 5: L1 package context, 20-case canary slice

  • Command: uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider dbt_package --context-level l1 --judge-model claude-sonnet-4-6 --judge-provider anthropic --limit 20 --concurrency 2
  • Config: baseline model claude-sonnet-4-6, L1 package metadata, 20-case canary slice, authenticated GitHub metadata fetch, judge retry enabled
  • Output: apps/evals/runs/sql/20260325_071633_raw_llm/results.jsonl
  • Started: 2026-03-25
  • Duration: <not captured>
  • Notes: This run was the first clean rerun after two harness fixes discovered during execution: (1) the scorer now retries malformed/empty judge JSON up to 3 times; (2) the dbt-package context provider now falls back to gh auth token when GH_TOKEN / GITHUB_TOKEN are unset, which avoided unauthenticated GitHub API rate limits. Parse rate held at 0.400, pass rate improved to 0.250, and malformed-judge fallbacks on the same slice dropped from 6 to 3.

Run 6: L3 package context with descriptions, 20-case canary slice

  • Command: uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider dbt_package --context-level l3 --judge-model claude-sonnet-4-6 --judge-provider anthropic --limit 20 --concurrency 2
  • Config: baseline model claude-sonnet-4-6, L3 package metadata with descriptions/docs, same 20-case canary slice and harness settings as Run 5
  • Output: apps/evals/runs/sql/20260325_071753_raw_llm/results.jsonl
  • Started: 2026-03-25
  • Duration: <not captured>
  • Notes: L3 preserved the same parse rate as L1 (0.400) and improved pass rate to 0.300 plus equivalence rate to 0.250. One case still exhausted all 3 judge retries, so the scorer is more resilient but not yet fully deterministic with Anthropic-as-judge.

Run 7: L0 no-context baseline, 20-case canary slice

  • Command: uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-level l0 --judge-model claude-sonnet-4-6 --judge-provider anthropic --limit 20 --concurrency 2
  • Config: baseline model claude-sonnet-4-6, L0 / no schema context, same 20-case canary slice and scorer settings
  • Output: apps/evals/runs/sql/20260325_071827_raw_llm/results.jsonl
  • Started: 2026-03-25
  • Duration: <not captured>
  • Notes: Confirmed the no-context floor on a larger slice: 0.000 parse and 0.000 pass. This reinforces that the model is not reliably generating parseable SQL on this benchmark without preloaded schema context.

Run 8: L0 no-context baseline, full 400-case canary

  • Command: uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-level l0 --judge-model claude-sonnet-4-6 --judge-provider anthropic --concurrency 2
  • Config: same claude-sonnet-4-6 baseline with no schema context, full stratified canary
  • Output: apps/evals/runs/sql/20260325_074847_raw_llm/results.jsonl
  • Started: 2026-03-25
  • Duration: <not captured>
  • Notes: Full-canary control confirmed the floor from the smaller slice: 0.000 parse and 0.000 pass across all 400 cases.

Run 9: L1 package context, 50-case slice, Anthropic judge diagnostic

  • Command: uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider dbt_package --context-level l1 --judge-model claude-sonnet-4-6 --judge-provider anthropic --limit 50 --concurrency 2
  • Config: same generator as prior runs, first attempt at a medium slice
  • Output: apps/evals/runs/sql/20260325_154124_raw_llm/results.jsonl
  • Started: 2026-03-25
  • Duration: <not captured>
  • Notes: The run completed with 0.220 pass and 0.440 parse, but it logged 11/50 judge parse fallbacks. This run is useful as a harness diagnostic, not as the primary comparison artifact.

Run 10: L1 package context, 50-case slice, OpenAI judge

  • Command: uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider dbt_package --context-level l1 --judge-model gpt-4.1-mini --judge-provider openai --limit 50 --concurrency 2
  • Config: claude-sonnet-4-6 generation with gpt-4.1-mini as judge, 50-case medium slice
  • Output: apps/evals/runs/sql/20260325_154418_raw_llm/results.jsonl
  • Started: 2026-03-25
  • Duration: <not captured>
  • Notes: This became the primary L1 comparison artifact. It completed with 0.400 pass / 0.440 parse / 0.240 equivalence and 0/50 judge parse fallbacks.

Run 11: L3 package context with descriptions, 50-case slice, OpenAI judge

  • Command: uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider dbt_package --context-level l3 --judge-model gpt-4.1-mini --judge-provider openai --limit 50 --concurrency 2
  • Config: same 50-case medium slice and judge as Run 10, with L3 descriptions/docs enabled
  • Output: apps/evals/runs/sql/20260325_154655_raw_llm/results.jsonl
  • Started: 2026-03-25
  • Duration: <not captured>
  • Notes: L3 landed at 0.380 pass / 0.420 parse / 0.280 equivalence with 0/50 judge parse fallbacks.

Run 12: L1 package context repeatability rerun, 50-case slice, OpenAI judge

  • Command: uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider dbt_package --context-level l1 --judge-model gpt-4.1-mini --judge-provider openai --limit 50 --concurrency 2
  • Config: exact rerun of Run 10 to measure judge/run stability
  • Output: apps/evals/runs/sql/20260325_154943_raw_llm/results.jsonl
  • Started: 2026-03-25
  • Duration: <not captured>
  • Notes: The rerun reproduced the headline L1 numbers exactly: 0.400 pass / 0.440 parse, with equivalence improving to 0.320 and again 0/50 judge parse fallbacks.

Run 13: Dundersign live-source smoke, L0 no-context

  • Command: uv run --extra agent python -m apps.evals sql --benchmark .tmp/dundersign_sql_smoke.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-level l0 --judge-model gpt-4.1-mini --judge-provider openai --dialect duckdb --concurrency 2
  • Config: 8-case source-matched smoke benchmark against a real local DuckDB source generated from faketran fake_companies/dundersign
  • Output: apps/evals/runs/sql/20260325_164313_raw_llm/results.jsonl
  • Started: 2026-03-25
  • Duration: <not captured>
  • Notes: This was the first honest live-source control. With no context, the generator again failed completely: 0.000 parse / 0.000 pass.

Run 14: Dundersign live-source smoke, information_schema

  • Command: uv run --extra agent python -m apps.evals sql --benchmark .tmp/dundersign_sql_smoke.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider information_schema --source dundersign_duckdb --context-level l2 --judge-model gpt-4.1-mini --judge-provider openai --dialect duckdb --concurrency 2
  • Config: same 8-case source-matched smoke benchmark, now using a real live source and the new information_schema context provider
  • Output: apps/evals/runs/sql/20260325_164341_raw_llm/results.jsonl
  • Started: 2026-03-25
  • Duration: <not captured>
  • Notes: This run proved the live-source path itself. It landed at 1.000 parse / 1.000 pass / 0.750 equivalence. That is a source-matched smoke result, not evidence about the cto-research schemas.

Run 15: Dundersign tracked smoke benchmark, L0 no-context

  • Command: uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dundersign_sql_smoke.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-level l0 --judge-model gpt-4.1-mini --judge-provider openai --dialect duckdb --concurrency 2
  • Config: tracked 19-case Dundersign smoke benchmark, all gold SQL pre-validated against the local DuckDB source
  • Output: apps/evals/runs/sql/20260325_165531_raw_llm/results.jsonl
  • Started: 2026-03-25
  • Duration: <not captured>
  • Notes: The larger Dundersign control reproduced the same floor: 0.000 parse / 0.000 pass across all 19 cases.

Run 16: Dundersign tracked smoke benchmark, information_schema

  • Command: uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dundersign_sql_smoke.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider information_schema --source dundersign_duckdb --context-level l2 --judge-model gpt-4.1-mini --judge-provider openai --dialect duckdb --concurrency 2
  • Config: same tracked 19-case Dundersign benchmark and live DuckDB source as Run 15
  • Output: apps/evals/runs/sql/20260325_165627_raw_llm/results.jsonl
  • Started: 2026-03-25
  • Duration: <not captured>
  • Notes: This is the clearest live-source result on the branch so far: 1.000 parse / 1.000 pass on the full tracked Dundersign smoke benchmark. It validates the information_schema provider and source wiring for a real source-matched benchmark.

Run 17: L1 package context, 50-case slice, Anthropic judge after judge-metrics upgrade

  • Command: uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider dbt_package --context-level l1 --judge-model claude-sonnet-4-6 --judge-provider anthropic --limit 50 --concurrency 2
  • Config: same 50-case L1 medium slice as Run 9, now with judge retry backoff plus explicit judge-attempt / parse-failure / fallback metrics in the run summary
  • Output: apps/evals/runs/sql/20260327_055939_raw_llm/results.jsonl
  • Started: 2026-03-26
  • Duration: <not captured>
  • Notes: This rerun showed that the scorer upgrade improved observability, not Anthropic judge quality. The judge still fell back on 11 cases, with 35 total parse failures across 22 judged cases, judge_retry_case_rate=0.545, and judge_avg_attempts=2.091. Pass/parse also regressed slightly to 0.200 / 0.440.

Run 18: L1 package context, 50-case slice, OpenAI judge after judge-metrics upgrade

  • Command: uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider dbt_package --context-level l1 --judge-model gpt-4.1-mini --judge-provider openai --limit 50 --concurrency 2
  • Config: same 50-case L1 slice and generator as Run 17, but with gpt-4.1-mini judge under the new scorer/reporting path
  • Output: apps/evals/runs/sql/20260327_060144_raw_llm/results.jsonl
  • Started: 2026-03-26
  • Duration: <not captured>
  • Notes: This control stayed clean: judge_retry_case_rate=0.0, judge_parse_failure_count=0, judge_fallback_count=0, judge_avg_attempts=1.0. Pass/parse came in at 0.420 / 0.420, with 0.240 strict equivalence.

Run 19: L1 package context, 50-case slice, OpenAI judge repeat after judge-metrics upgrade

  • Command: uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider dbt_package --context-level l1 --judge-model gpt-4.1-mini --judge-provider openai --limit 50 --concurrency 2
  • Config: exact rerun of Run 18 to distinguish judge stability from ordinary generator variance
  • Output: apps/evals/runs/sql/20260327_060351_raw_llm/results.jsonl
  • Started: 2026-03-26
  • Duration: <not captured>
  • Notes: The repeat stayed clean on judge behavior: zero retries, zero parse failures, zero fallbacks again, with judge_avg_attempts=1.0. Pass/parse moved modestly to 0.440 / 0.440 while strict equivalence held at 0.240, which suggests the remaining variance is generator-side rather than judge-side.

Results

Condition Pass Rate Avg Score Parse Grounding Intent Latency Tokens
L0 (20-case slice) 0.000 n/a 0.000 n/a 0.000 not recorded not recorded
L0 (400-case canary) 0.000 n/a 0.000 n/a 0.000 not recorded not recorded
L1 (20-case slice, Anthropic judge) 0.250 n/a 0.400 n/a 0.150 not recorded not recorded
L3 (20-case slice, Anthropic judge) 0.300 n/a 0.400 n/a 0.250 not recorded not recorded
L1 (50-case slice, Anthropic judge diagnostic) 0.220 n/a 0.440 n/a 0.180 not recorded not recorded
L1 (50-case slice, OpenAI judge) 0.400 n/a 0.440 n/a 0.240 not recorded not recorded
L3 (50-case slice, OpenAI judge) 0.380 n/a 0.420 n/a 0.280 not recorded not recorded
L1 (50-case slice, OpenAI judge repeat) 0.400 n/a 0.440 n/a 0.320 not recorded not recorded
L1 (50-case slice, Anthropic judge + metrics upgrade) 0.200 n/a 0.440 n/a 0.160 not recorded not recorded
L1 (50-case slice, OpenAI judge + metrics upgrade) 0.420 n/a 0.420 n/a 0.240 not recorded not recorded
L1 (50-case slice, OpenAI judge + metrics upgrade repeat) 0.440 n/a 0.440 n/a 0.240 not recorded not recorded
Dundersign live source, L0 (8-case smoke) 0.000 n/a 0.000 n/a 0.000 not recorded not recorded
Dundersign live source, information_schema (8-case smoke) 1.000 n/a 1.000 n/a 0.750 not recorded not recorded
Dundersign live source, L0 (19-case tracked smoke) 0.000 n/a 0.000 n/a 0.000 not recorded not recorded
Dundersign live source, information_schema (19-case tracked smoke) 1.000 n/a 1.000 n/a not recorded not recorded not recorded
L5 provider wired; benchmark artifact missing n/a n/a n/a n/a n/a n/a

Breakdowns

  • Historical 5-case smoke runs on 2026-03-18 still matter as a quick sanity check: both L1 and L3 reached 0.600 pass / 1.000 parse on that tiny ad_reporting slice, while L0 stayed at 0.000.
  • On the 20-case slice, L0 remained a hard floor, L1 and L3 tied on parse rate, and L3 slightly outperformed L1 on judged correctness rather than parseability.
  • On the newer 50-case OpenAI-judged slice, L1 and L3 were effectively tied: L1 averaged 0.400 pass / 0.440 parse across two identical reruns, while L3 landed at 0.380 pass / 0.420 parse with slightly higher equivalence (0.280 vs 0.240 on the first L1 run).
  • The full 400-case L0 control confirms this is not just a small-sample effect: without schema context, the current baseline generator is unusable on the canary benchmark.
  • The Dundersign live-source smoke benchmark now proves the missing control that was not available earlier in the day: with a real source configured, information_schema moved the model from 0.000 parse / 0.000 pass to 1.000 parse / 1.000 pass on both an 8-case smoke set and a tracked 19-case source-matched benchmark.
  • The scorer now makes judge quality visible in the artifacts. On the same 50-case L1 slice, Anthropic judge recorded 35 parse failures, 11 fallback cases, and judge_retry_case_rate=0.545, while two fresh OpenAI judge reruns both recorded 0 retries and 0 fallbacks with judge_avg_attempts=1.0.

Analysis

The experiment is no longer blocked at L1/L3, and the new 20-case plus 50-case reruns exposed and fixed three harness issues that would have made the comparison noisy or flaky on current main:

  1. Anthropic judge responses sometimes came back empty or non-JSON. The scorer now retries malformed judge output up to 3 times before recording a fallback failure.
  2. The dbt-package context provider was using unauthenticated GitHub API requests unless GH_TOKEN / GITHUB_TOKEN were manually exported. It now falls back to gh auth token, which eliminated the GitHub 403 rate-limit failure seen during the first 20-case rerun.
  3. The unified apps.evals CLI did not load .env from the task worktree, which made provider availability look inconsistent across sessions. It now loads a local .env from Path.cwd() when present.

What the early result suggests:

  1. Preloaded schema context clearly matters. L0 is 0.000 parse / 0.000 pass on both the 20-case slice and the full 400-case canary.
  2. Once the judge was stabilized with gpt-4.1-mini, L1 and L3 became close enough that there is no honest “winner” yet. L3 still looks slightly better on semantic equivalence, but L1 is slightly better on pass and parse rate on the 50-case slice.
  3. The OpenAI-judged L1 rerun reproduced the headline metrics exactly (0.400 pass / 0.440 parse), which is the first convincing evidence that the medium-slice comparison is stable enough to use as a current-state snapshot.
  4. The Anthropic-judged 50-case run is still useful as a diagnosis: 11/50 fallback judgements is too much noise for a primary comparison artifact, even though the scorer now fails soft instead of crashing.
  5. The experiment now has a true L5 code path through inspect.json artifacts, but the current benchmark still lacks matching inspection artifacts for its schemas.
  6. The new Dundersign runs close a different gap than the cto-research runs. They validate that the live-source information_schema path works end to end on a real source, but they do not change the fact that there is still no live-source source coverage for the cto-research schemas themselves.
  7. The judge upgrades helped in two concrete ways: there is now retry backoff instead of hammering the provider, and every run now exposes judge retry/fallback metrics directly in summary.json. They did not rescue Anthropic-as-judge on this slice. The fresh Anthropic rerun still fell back on the same order of magnitude of cases (11), which makes OpenAI the only honest default judge for medium-size comparisons right now.
  8. The two new OpenAI-judged reruns suggest the remaining variation is mostly generation-side, not judge-side. Pass/parse moved from 0.420/0.420 to 0.440/0.440, but strict equivalence stayed fixed at 0.240 and judge retries stayed at zero.

Conclusion

L0 is the floor, and that result now holds in both settings we can honestly measure today: the cto-research canary and a source-matched live Dundersign benchmark. With a reliable OpenAI judge, the current cto-research 50-case snapshot still says L1 and L3 are effectively tied for now. Separately, the live-source Dundersign benchmark shows that information_schema is now a real, working control path rather than a theoretical one. The new judge-focused reruns make the judge decision much clearer: Anthropic-as-judge is still too noisy for primary evals even with backoff and retries, while gpt-4.1-mini judge remained clean across two fresh 50-case reruns with zero retries and zero fallbacks. The practical decision is: keep schema context mandatory, treat L1 and L3 as both viable on the benchmark we actually have, keep gpt-4.1-mini as the default SQL eval judge, and treat information_schema as validated for live-source evals when a source-matched benchmark exists. L5 remains blocked on missing benchmark-aligned inspection artifacts, and there is still no honest live-source coverage for the cto-research schemas.

Follow-up experiments

  • Expand the same OpenAI-judged L1 vs L3 comparison from 50 cases to 100+ or the full canary now that the harness is stable enough.
  • Create or import a source-matched live benchmark for one of the real cto-research schemas if we want a direct dbt_package versus information_schema comparison on that dataset family.
  • Run a true model-comparison pass now that worktree-local .env loading exposes both Anthropic and OpenAI credentials consistently.
  • Produce or collect per-schema inspect.json / context-catalog artifacts for benchmark schemas so the new L5 provider can be run against the benchmark honestly.

2026-03-22 Triage Decision

  • Status set to blocked (not obsolete): L0/L1/L3 runs and provider wiring are done (PRs #674, #686), but benchmark-aligned L5 cannot be executed until matching inspection artifacts exist for benchmark schemas.