Experiment: Context ablation L0 vs L1 vs L3 vs L5
Hypothesis
Adding descriptions should produce the largest jump in SQL quality, with profile stats and sample values helping on more ambiguous or aggregation-heavy cases. Names-only context should be the weakest setting.
Method
Use the winning model from the model-comparison wave, keep the prompt and benchmark subset fixed, and vary only the schema context level across L0, L1, L3, and L5. Keep tool policy unchanged during this experiment so the effect of the preloaded context itself is isolated.
Current execution note: the cleaned benchmark does not carry precomputed schema_context, so richer levels have to be constructed at backend time. This branch now has a dbt-package context provider that fetches model YAML + docs blocks from the public Fivetran dbt package repos, which is enough for L1 and L3. It also now has an inspect.json-backed provider for true L5 rich profile context. The remaining limitation is not code-path support anymore; it is data availability. The current dbt-package benchmark does not yet have matching inspection artifacts per benchmark schema, so the checked-in dundersign example profile can prove the provider works but cannot be used to score the existing cto-research canary rows. As of 2026-03-25, the unified apps.evals CLI also loads a local .env from the worktree root so provider credentials resolve consistently inside task worktrees. As of 2026-03-26, the scorer also records judge attempts, parse failures, and fallback usage into both per-case records and top-line run summaries.
Variables
| Variable | Values |
|---|---|
| Independent | schema context level (L0, L1, L3, L5) |
| Dependent | pass rate, equivalence rate, grounding failures, latency, token cost |
| Controlled | model, prompt, canary subset, tool policy, scorer, seed, temperature |
Execution Log
Run 1: L0 names only
- Command:
uv run python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --judge-model claude-sonnet-4-6 --judge-provider anthropic --limit 5 - Config: provisional baseline model
claude-sonnet-4-6,L0/ no schema context, canary subset limited to 5 cases - Output:
apps/evals/output/sql/20260318_215106_raw_llm/results.jsonl - Started:
2026-03-18 - Duration:
<not captured> - Notes: First successful end-to-end run on this task branch. Pass and parse were both
0.000, which is directionally expected for a no-context baseline. Setup discoveries: (1) the Anthropic account does not exposeclaude-3-5-sonnet-latest, so the experiment branch now usesclaude-sonnet-4-6; (2) the eval scorer needed a guard for tokenizer failures when models return prose instead of SQL.
Run 2: L1 names-only package context
- Command:
uv run python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider dbt_package --context-level l1 --judge-model claude-sonnet-4-6 --judge-provider anthropic --limit 5 - Config: provisional baseline model
claude-sonnet-4-6,L1via public dbt package metadata (table + column names only), same 5-case canary slice - Output:
apps/evals/output/sql/20260318_230727_raw_llm/results.jsonl - Started:
2026-03-18 - Duration:
<not captured> - Notes: This run used a new backend-time context provider instead of baking schema payloads into the benchmark. It improved the same 5-case slice from
0.000pass /0.000parse to0.600pass /1.000parse. Two judge responses were malformed JSON in logs, but the run completed and wrote artifacts.
Run 3: L3 package context with descriptions
- Command:
uv run python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider dbt_package --context-level l3 --judge-model claude-sonnet-4-6 --judge-provider anthropic --limit 5 - Config: provisional baseline model
claude-sonnet-4-6,L3via public dbt package metadata (table names, column names, descriptions/docs), same 5-case canary slice - Output:
apps/evals/output/sql/20260318_230743_raw_llm/results.jsonl - Started:
2026-03-18 - Duration:
<not captured> - Notes:
L3matchedL1on pass rate (0.600) and parse rate (1.000) on this slice, but did not improve equivalence (0.400vs0.600onL1). This suggests descriptions are not an obvious win on the tiny ad-reporting sample and may add noise or simply need a broader slice before deciding.
Run 4: L5 full context
- Command:
uv run python - <<'PY' ... make_inspect_cache_context_provider(inspect_path='examples/target/inspect.json', context_level='l5') ... PY - Config: direct provider verification against the checked-in example inspection artifact, using a matching
tables_used=['users']example case - Output: in-process provider output only (no benchmark run artifact yet)
- Started:
2026-03-18 - Duration:
<not captured> - Notes: The
L5provider path is now implemented and produces real rich context frominspect.jsonartifacts, includingrow_count,grain,primary_date, semantic tags, and categorical values. What remains blocked is a benchmark-alignedL5eval run, because the current dbt-package benchmark cases do not yet have corresponding profiled inspection artifacts for their source schemas.
Run 5: L1 package context, 20-case canary slice
- Command:
uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider dbt_package --context-level l1 --judge-model claude-sonnet-4-6 --judge-provider anthropic --limit 20 --concurrency 2 - Config: baseline model
claude-sonnet-4-6,L1package metadata, 20-case canary slice, authenticated GitHub metadata fetch, judge retry enabled - Output:
apps/evals/runs/sql/20260325_071633_raw_llm/results.jsonl - Started:
2026-03-25 - Duration:
<not captured> - Notes: This run was the first clean rerun after two harness fixes discovered during execution: (1) the scorer now retries malformed/empty judge JSON up to 3 times; (2) the dbt-package context provider now falls back to
gh auth tokenwhenGH_TOKEN/GITHUB_TOKENare unset, which avoided unauthenticated GitHub API rate limits. Parse rate held at0.400, pass rate improved to0.250, and malformed-judge fallbacks on the same slice dropped from 6 to 3.
Run 6: L3 package context with descriptions, 20-case canary slice
- Command:
uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider dbt_package --context-level l3 --judge-model claude-sonnet-4-6 --judge-provider anthropic --limit 20 --concurrency 2 - Config: baseline model
claude-sonnet-4-6,L3package metadata with descriptions/docs, same 20-case canary slice and harness settings as Run 5 - Output:
apps/evals/runs/sql/20260325_071753_raw_llm/results.jsonl - Started:
2026-03-25 - Duration:
<not captured> - Notes:
L3preserved the same parse rate asL1(0.400) and improved pass rate to0.300plus equivalence rate to0.250. One case still exhausted all 3 judge retries, so the scorer is more resilient but not yet fully deterministic with Anthropic-as-judge.
Run 7: L0 no-context baseline, 20-case canary slice
- Command:
uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-level l0 --judge-model claude-sonnet-4-6 --judge-provider anthropic --limit 20 --concurrency 2 - Config: baseline model
claude-sonnet-4-6,L0/ no schema context, same 20-case canary slice and scorer settings - Output:
apps/evals/runs/sql/20260325_071827_raw_llm/results.jsonl - Started:
2026-03-25 - Duration:
<not captured> - Notes: Confirmed the no-context floor on a larger slice:
0.000parse and0.000pass. This reinforces that the model is not reliably generating parseable SQL on this benchmark without preloaded schema context.
Run 8: L0 no-context baseline, full 400-case canary
- Command:
uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-level l0 --judge-model claude-sonnet-4-6 --judge-provider anthropic --concurrency 2 - Config: same
claude-sonnet-4-6baseline with no schema context, full stratified canary - Output:
apps/evals/runs/sql/20260325_074847_raw_llm/results.jsonl - Started:
2026-03-25 - Duration:
<not captured> - Notes: Full-canary control confirmed the floor from the smaller slice:
0.000parse and0.000pass across all 400 cases.
Run 9: L1 package context, 50-case slice, Anthropic judge diagnostic
- Command:
uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider dbt_package --context-level l1 --judge-model claude-sonnet-4-6 --judge-provider anthropic --limit 50 --concurrency 2 - Config: same generator as prior runs, first attempt at a medium slice
- Output:
apps/evals/runs/sql/20260325_154124_raw_llm/results.jsonl - Started:
2026-03-25 - Duration:
<not captured> - Notes: The run completed with
0.220pass and0.440parse, but it logged11/50judge parse fallbacks. This run is useful as a harness diagnostic, not as the primary comparison artifact.
Run 10: L1 package context, 50-case slice, OpenAI judge
- Command:
uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider dbt_package --context-level l1 --judge-model gpt-4.1-mini --judge-provider openai --limit 50 --concurrency 2 - Config:
claude-sonnet-4-6generation withgpt-4.1-minias judge, 50-case medium slice - Output:
apps/evals/runs/sql/20260325_154418_raw_llm/results.jsonl - Started:
2026-03-25 - Duration:
<not captured> - Notes: This became the primary
L1comparison artifact. It completed with0.400pass /0.440parse /0.240equivalence and0/50judge parse fallbacks.
Run 11: L3 package context with descriptions, 50-case slice, OpenAI judge
- Command:
uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider dbt_package --context-level l3 --judge-model gpt-4.1-mini --judge-provider openai --limit 50 --concurrency 2 - Config: same 50-case medium slice and judge as Run 10, with
L3descriptions/docs enabled - Output:
apps/evals/runs/sql/20260325_154655_raw_llm/results.jsonl - Started:
2026-03-25 - Duration:
<not captured> - Notes:
L3landed at0.380pass /0.420parse /0.280equivalence with0/50judge parse fallbacks.
Run 12: L1 package context repeatability rerun, 50-case slice, OpenAI judge
- Command:
uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider dbt_package --context-level l1 --judge-model gpt-4.1-mini --judge-provider openai --limit 50 --concurrency 2 - Config: exact rerun of Run 10 to measure judge/run stability
- Output:
apps/evals/runs/sql/20260325_154943_raw_llm/results.jsonl - Started:
2026-03-25 - Duration:
<not captured> - Notes: The rerun reproduced the headline
L1numbers exactly:0.400pass /0.440parse, with equivalence improving to0.320and again0/50judge parse fallbacks.
Run 13: Dundersign live-source smoke, L0 no-context
- Command:
uv run --extra agent python -m apps.evals sql --benchmark .tmp/dundersign_sql_smoke.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-level l0 --judge-model gpt-4.1-mini --judge-provider openai --dialect duckdb --concurrency 2 - Config: 8-case source-matched smoke benchmark against a real local DuckDB source generated from
faketran fake_companies/dundersign - Output:
apps/evals/runs/sql/20260325_164313_raw_llm/results.jsonl - Started:
2026-03-25 - Duration:
<not captured> - Notes: This was the first honest live-source control. With no context, the generator again failed completely:
0.000parse /0.000pass.
Run 14: Dundersign live-source smoke, information_schema
- Command:
uv run --extra agent python -m apps.evals sql --benchmark .tmp/dundersign_sql_smoke.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider information_schema --source dundersign_duckdb --context-level l2 --judge-model gpt-4.1-mini --judge-provider openai --dialect duckdb --concurrency 2 - Config: same 8-case source-matched smoke benchmark, now using a real live source and the new
information_schemacontext provider - Output:
apps/evals/runs/sql/20260325_164341_raw_llm/results.jsonl - Started:
2026-03-25 - Duration:
<not captured> - Notes: This run proved the live-source path itself. It landed at
1.000parse /1.000pass /0.750equivalence. That is a source-matched smoke result, not evidence about thecto-researchschemas.
Run 15: Dundersign tracked smoke benchmark, L0 no-context
- Command:
uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dundersign_sql_smoke.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-level l0 --judge-model gpt-4.1-mini --judge-provider openai --dialect duckdb --concurrency 2 - Config: tracked 19-case Dundersign smoke benchmark, all gold SQL pre-validated against the local DuckDB source
- Output:
apps/evals/runs/sql/20260325_165531_raw_llm/results.jsonl - Started:
2026-03-25 - Duration:
<not captured> - Notes: The larger Dundersign control reproduced the same floor:
0.000parse /0.000pass across all 19 cases.
Run 16: Dundersign tracked smoke benchmark, information_schema
- Command:
uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dundersign_sql_smoke.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider information_schema --source dundersign_duckdb --context-level l2 --judge-model gpt-4.1-mini --judge-provider openai --dialect duckdb --concurrency 2 - Config: same tracked 19-case Dundersign benchmark and live DuckDB source as Run 15
- Output:
apps/evals/runs/sql/20260325_165627_raw_llm/results.jsonl - Started:
2026-03-25 - Duration:
<not captured> - Notes: This is the clearest live-source result on the branch so far:
1.000parse /1.000pass on the full tracked Dundersign smoke benchmark. It validates theinformation_schemaprovider and source wiring for a real source-matched benchmark.
Run 17: L1 package context, 50-case slice, Anthropic judge after judge-metrics upgrade
- Command:
uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider dbt_package --context-level l1 --judge-model claude-sonnet-4-6 --judge-provider anthropic --limit 50 --concurrency 2 - Config: same 50-case
L1medium slice as Run 9, now with judge retry backoff plus explicit judge-attempt / parse-failure / fallback metrics in the run summary - Output:
apps/evals/runs/sql/20260327_055939_raw_llm/results.jsonl - Started:
2026-03-26 - Duration:
<not captured> - Notes: This rerun showed that the scorer upgrade improved observability, not Anthropic judge quality. The judge still fell back on
11cases, with35total parse failures across22judged cases,judge_retry_case_rate=0.545, andjudge_avg_attempts=2.091. Pass/parse also regressed slightly to0.200/0.440.
Run 18: L1 package context, 50-case slice, OpenAI judge after judge-metrics upgrade
- Command:
uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider dbt_package --context-level l1 --judge-model gpt-4.1-mini --judge-provider openai --limit 50 --concurrency 2 - Config: same 50-case
L1slice and generator as Run 17, but withgpt-4.1-minijudge under the new scorer/reporting path - Output:
apps/evals/runs/sql/20260327_060144_raw_llm/results.jsonl - Started:
2026-03-26 - Duration:
<not captured> - Notes: This control stayed clean:
judge_retry_case_rate=0.0,judge_parse_failure_count=0,judge_fallback_count=0,judge_avg_attempts=1.0. Pass/parse came in at0.420/0.420, with0.240strict equivalence.
Run 19: L1 package context, 50-case slice, OpenAI judge repeat after judge-metrics upgrade
- Command:
uv run --extra agent python -m apps.evals sql --benchmark apps/evals/data/dbt_sql_canary.jsonl --backend raw_llm --model claude-sonnet-4-6 --provider anthropic --context-provider dbt_package --context-level l1 --judge-model gpt-4.1-mini --judge-provider openai --limit 50 --concurrency 2 - Config: exact rerun of Run 18 to distinguish judge stability from ordinary generator variance
- Output:
apps/evals/runs/sql/20260327_060351_raw_llm/results.jsonl - Started:
2026-03-26 - Duration:
<not captured> - Notes: The repeat stayed clean on judge behavior: zero retries, zero parse failures, zero fallbacks again, with
judge_avg_attempts=1.0. Pass/parse moved modestly to0.440/0.440while strict equivalence held at0.240, which suggests the remaining variance is generator-side rather than judge-side.
Results
| Condition | Pass Rate | Avg Score | Parse | Grounding | Intent | Latency | Tokens |
|---|---|---|---|---|---|---|---|
| L0 (20-case slice) | 0.000 | n/a | 0.000 | n/a | 0.000 | not recorded | not recorded |
| L0 (400-case canary) | 0.000 | n/a | 0.000 | n/a | 0.000 | not recorded | not recorded |
| L1 (20-case slice, Anthropic judge) | 0.250 | n/a | 0.400 | n/a | 0.150 | not recorded | not recorded |
| L3 (20-case slice, Anthropic judge) | 0.300 | n/a | 0.400 | n/a | 0.250 | not recorded | not recorded |
| L1 (50-case slice, Anthropic judge diagnostic) | 0.220 | n/a | 0.440 | n/a | 0.180 | not recorded | not recorded |
| L1 (50-case slice, OpenAI judge) | 0.400 | n/a | 0.440 | n/a | 0.240 | not recorded | not recorded |
| L3 (50-case slice, OpenAI judge) | 0.380 | n/a | 0.420 | n/a | 0.280 | not recorded | not recorded |
| L1 (50-case slice, OpenAI judge repeat) | 0.400 | n/a | 0.440 | n/a | 0.320 | not recorded | not recorded |
| L1 (50-case slice, Anthropic judge + metrics upgrade) | 0.200 | n/a | 0.440 | n/a | 0.160 | not recorded | not recorded |
| L1 (50-case slice, OpenAI judge + metrics upgrade) | 0.420 | n/a | 0.420 | n/a | 0.240 | not recorded | not recorded |
| L1 (50-case slice, OpenAI judge + metrics upgrade repeat) | 0.440 | n/a | 0.440 | n/a | 0.240 | not recorded | not recorded |
| Dundersign live source, L0 (8-case smoke) | 0.000 | n/a | 0.000 | n/a | 0.000 | not recorded | not recorded |
Dundersign live source, information_schema (8-case smoke) |
1.000 | n/a | 1.000 | n/a | 0.750 | not recorded | not recorded |
| Dundersign live source, L0 (19-case tracked smoke) | 0.000 | n/a | 0.000 | n/a | 0.000 | not recorded | not recorded |
Dundersign live source, information_schema (19-case tracked smoke) |
1.000 | n/a | 1.000 | n/a | not recorded | not recorded | not recorded |
| L5 | provider wired; benchmark artifact missing | n/a | n/a | n/a | n/a | n/a | n/a |
Breakdowns
- Historical 5-case smoke runs on
2026-03-18still matter as a quick sanity check: bothL1andL3reached0.600pass /1.000parse on that tinyad_reportingslice, whileL0stayed at0.000. - On the 20-case slice,
L0remained a hard floor,L1andL3tied on parse rate, andL3slightly outperformedL1on judged correctness rather than parseability. - On the newer 50-case OpenAI-judged slice,
L1andL3were effectively tied:L1averaged0.400pass /0.440parse across two identical reruns, whileL3landed at0.380pass /0.420parse with slightly higher equivalence (0.280vs0.240on the firstL1run). - The full 400-case
L0control confirms this is not just a small-sample effect: without schema context, the current baseline generator is unusable on the canary benchmark. - The Dundersign live-source smoke benchmark now proves the missing control that was not available earlier in the day: with a real source configured,
information_schemamoved the model from0.000parse /0.000pass to1.000parse /1.000pass on both an 8-case smoke set and a tracked 19-case source-matched benchmark. - The scorer now makes judge quality visible in the artifacts. On the same 50-case
L1slice, Anthropic judge recorded35parse failures,11fallback cases, andjudge_retry_case_rate=0.545, while two fresh OpenAI judge reruns both recorded0retries and0fallbacks withjudge_avg_attempts=1.0.
Analysis
The experiment is no longer blocked at L1/L3, and the new 20-case plus 50-case reruns exposed and fixed three harness issues that would have made the comparison noisy or flaky on current main:
- Anthropic judge responses sometimes came back empty or non-JSON. The scorer now retries malformed judge output up to 3 times before recording a fallback failure.
- The dbt-package context provider was using unauthenticated GitHub API requests unless
GH_TOKEN/GITHUB_TOKENwere manually exported. It now falls back togh auth token, which eliminated the GitHub 403 rate-limit failure seen during the first 20-case rerun. - The unified
apps.evalsCLI did not load.envfrom the task worktree, which made provider availability look inconsistent across sessions. It now loads a local.envfromPath.cwd()when present.
What the early result suggests:
- Preloaded schema context clearly matters.
L0is0.000parse /0.000pass on both the 20-case slice and the full 400-case canary. - Once the judge was stabilized with
gpt-4.1-mini,L1andL3became close enough that there is no honest “winner” yet.L3still looks slightly better on semantic equivalence, butL1is slightly better on pass and parse rate on the 50-case slice. - The OpenAI-judged
L1rerun reproduced the headline metrics exactly (0.400pass /0.440parse), which is the first convincing evidence that the medium-slice comparison is stable enough to use as a current-state snapshot. - The Anthropic-judged 50-case run is still useful as a diagnosis:
11/50fallback judgements is too much noise for a primary comparison artifact, even though the scorer now fails soft instead of crashing. - The experiment now has a true
L5code path throughinspect.jsonartifacts, but the current benchmark still lacks matching inspection artifacts for its schemas. - The new Dundersign runs close a different gap than the
cto-researchruns. They validate that the live-sourceinformation_schemapath works end to end on a real source, but they do not change the fact that there is still no live-source source coverage for thecto-researchschemas themselves. - The judge upgrades helped in two concrete ways: there is now retry backoff instead of hammering the provider, and every run now exposes judge retry/fallback metrics directly in
summary.json. They did not rescue Anthropic-as-judge on this slice. The fresh Anthropic rerun still fell back on the same order of magnitude of cases (11), which makes OpenAI the only honest default judge for medium-size comparisons right now. - The two new OpenAI-judged reruns suggest the remaining variation is mostly generation-side, not judge-side. Pass/parse moved from
0.420/0.420to0.440/0.440, but strict equivalence stayed fixed at0.240and judge retries stayed at zero.
Conclusion
L0 is the floor, and that result now holds in both settings we can honestly measure today: the cto-research canary and a source-matched live Dundersign benchmark. With a reliable OpenAI judge, the current cto-research 50-case snapshot still says L1 and L3 are effectively tied for now. Separately, the live-source Dundersign benchmark shows that information_schema is now a real, working control path rather than a theoretical one. The new judge-focused reruns make the judge decision much clearer: Anthropic-as-judge is still too noisy for primary evals even with backoff and retries, while gpt-4.1-mini judge remained clean across two fresh 50-case reruns with zero retries and zero fallbacks. The practical decision is: keep schema context mandatory, treat L1 and L3 as both viable on the benchmark we actually have, keep gpt-4.1-mini as the default SQL eval judge, and treat information_schema as validated for live-source evals when a source-matched benchmark exists. L5 remains blocked on missing benchmark-aligned inspection artifacts, and there is still no honest live-source coverage for the cto-research schemas.
Follow-up experiments
- Expand the same OpenAI-judged
L1vsL3comparison from 50 cases to 100+ or the full canary now that the harness is stable enough. - Create or import a source-matched live benchmark for one of the real
cto-researchschemas if we want a directdbt_packageversusinformation_schemacomparison on that dataset family. - Run a true model-comparison pass now that worktree-local
.envloading exposes both Anthropic and OpenAI credentials consistently. - Produce or collect per-schema
inspect.json/ context-catalog artifacts for benchmark schemas so the newL5provider can be run against the benchmark honestly.
2026-03-22 Triage Decision
- Status set to
blocked(not obsolete):L0/L1/L3runs and provider wiring are done (PRs#674,#686), but benchmark-alignedL5cannot be executed until matching inspection artifacts exist for benchmark schemas.