Curate schema and table scope for eval benchmark

ID	MCP_ANALYST_AGENT-CURATE_SCHEMA_AND_TABLE_SCOPE_FOR_EVAL_BENCHMARK
Status	completed
Priority	p2
Milestone	m2-internal-adoption-design-partners
Owner	data-ai-engineer-architect
Initiative	ai-quality-experimentation-and-context-optimization
Completed by	dave
Completed	2026-03-18

Problem

Decide which schemas, tables, and data layers (raw, silver/staging, gold/marts) to include in the eval scope and catalog context. The analytics warehouse has raw, staging, and transform layers — including everything adds noise, but restricting to only the gold layer may hurt agent performance by hiding useful context. Run experiments comparing agent quality across different table scope configurations. Produce a curated allowlist and document the reasoning. This feeds back into both the eval benchmark filtering and the catalog context the agent sees in production.

Context

The problem with "just use the gold layer"

A typical dbt warehouse has three layers: - Raw/staging (stg_*) — direct copies from sources, minimally transformed - Silver/intermediate (int_*) — cleaned, joined, business logic applied - Gold/marts (fct_*, dim_*) — curated analytical models, the "right" tables for analysts

The instinct is to only expose gold to the agent. But this may hurt more than help: - Analysts often query staging tables for data that hasn't been modeled yet - Raw tables contain columns and values that gold tables abstract away - Some questions genuinely need raw data (debugging, freshness checks, data quality) - Restricting to gold reduces the table count the agent sees (less noise) but also reduces its ability to answer questions that don't fit the gold model

The benchmark dataset (from cto-research) uses schemas from open-source dbt packages. These have their own layer conventions. The Fivetran analytics warehouse has its own. Both need curation.

This should be done together with the ablation work

The "right" scope is an empirical question, not a design decision. Do not make this a separate upfront decision that later gets "validated." Run it together with the context/schema-tool ablation work:

layer scope (gold-only vs gold+silver vs all)
schema tool choice (profiled catalog vs filtered catalog vs live INFORMATION_SCHEMA vs no tool)
field filtering within a tool (names only vs +types vs +descriptions vs +profile stats vs +sample values)

Then produce the allowlist based on measured quality and latency, not intuition.

Repo context already available

The eval project already has a clear home in this repo:

apps/evals/dataface.yml is the DuckDB-backed eval project config.
apps/evals/faces/ is the dashboard/query layer for leaderboard analysis.
apps/evals/output/ is the run artifact sink and should stay transient.
apps/evals/data/ holds the benchmark JSONL and canary slice.
dataface/core/inspect/connection.py::list_tables(schema_filter=...) is an existing runtime scoping hook.

For a stable allowlist/scope artifact, the most natural checked-in location is apps/evals/data/. That keeps curation adjacent to the benchmark rather than burying it in the runner or dashboard layer.

Deliverables

Inventory the analytics warehouse schemas/tables with layer tags
Run layer-scope + schema-tool experiments using the eval system
Produce a curated allowlist with documented reasoning
Apply the allowlist to both the eval benchmark (filter cases) and the catalog context/tool defaults (filter tables and context fields)

Dependencies

Depends on having a working analytics warehouse inspection path (the consolidated analytics repo + BigQuery bootstrap work), so we can inspect the real internal table landscape
Depends on the eval runner (task 3) to measure impact of different scopes
Results feed back into the ablation experiments task

Possible Solutions

Recommended - derive the final scope from experiment results, then persist a small allowlist manifest alongside the eval data. - Pros: grounded in measured quality/latency, reproducible, easy to wire into benchmark filtering and catalog defaults. - Cons: requires the ablation work first, so there is no instant answer.
Hardcode a gold-only allowlist up front. - Pros: simple and low-risk to implement. - Cons: likely too restrictive for real analyst questions and too much intuition, not enough evidence.
Leave scope fully dynamic and never persist a curated manifest. - Pros: no extra artifact to maintain. - Cons: no stable benchmark input and no durable default for downstream consumers.

Plan

Inventory the schema families exposed by the benchmark and analytics context. - Benchmark schemas currently cluster around a long tail of source packages plus internal main. - Likely layer groupings in the analytics repo are gold (bi_core), staging (staging / stg_*), intermediate (int_*), and raw/source schemas.
Use the eval runner's existing --schema, --category, and --complexity filters to compare scope bundles. - gold-only - gold+silver/staging - all tables
Keep the schema-tool ablations aligned with the scope runs. - profiled catalog vs filtered catalog - live INFORMATION_SCHEMA vs no tool - names only vs names+types vs richer field payloads
Persist the curation decision as a checked-in manifest under apps/evals/data/ so it travels with the benchmark data rather than the transient run output.
Feed the resulting scope manifest into the follow-on catalog defaults / benchmark filtering work instead of overfitting this task to a single backend.

Implementation Progress

Started repo inventory for scope curation.
Confirmed the eval project already exists under apps/evals/ with dataface.yml, faces/, data/, and output/ boundaries.
Confirmed the inspector has a live schema_filter hook in dataface/core/inspect/connection.py, so scope curation can be applied without inventing a new runtime shape.
Sampled apps/evals/data/dbt_sql_benchmark.jsonl to establish the benchmark distribution.
Largest schemas include main, stripe, shopify, greenhouse, hubspot, pendo, quickbooks, sap, workday, jira, zuora, iterable, twilio, google_play, marketo, sage_intacct, snapchat_ads, amazon_selling_partner, linkedin_ads, mailchimp, recurly, servicenow, zendesk, and pardot.
Category mix is dominated by group_by, filter_agg, join, window_function, and subquery.
Cross-checked the preserved analytics-repo bootstrap context to pin down a realistic internal warehouse shape for the eventual scope bundles.
bi_core/ is the current gold layer with roughly 45 analytical models.
staging/ is materially larger at roughly 143 models across many source packages.
intermediate/ is comparatively small.
That internal shape sharpens the likely experiment bundles:
gold-only should map to bi_core
gold+silver should map to bi_core plus staging / limited intermediate support
all should remain the stress test rather than the default
The context-ablation workstream is no longer blocked on provider code for rich context.
L1 / L3 already run from dbt package metadata.
L5 now has an inspect.json-backed provider path for projects that already have profiled artifacts.
Remaining gap for this scope task is data and decision-making, not eval plumbing: the benchmark still needs schema-aligned rich artifacts and then scope runs against realistic layer bundles.
Working assumption: gold-only should be the baseline, but staging/intermediate should be measured as additive context rather than excluded by default.
Added apps/evals/data/schema_scope_manifest.yml as the checked-in curation artifact for downstream eval/config consumers.
gold_only maps to bi_core/** and is the recommended default bundle.
gold_plus_silver expands to bi_core/** + staging/** + intermediate/** and is the recommended escalation/eval comparison bundle.
all remains a stress-test bundle, not a default.
This task now closes the curation decision and artifact. Future work is refinement: replacing directory-level bundle definitions with concrete table allowlists once the dashboard-factory analytics bootstrap is live.

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A — this is analysis and curation work, not UI.

Review Feedback

[x] Review cleared