Dataface Tasks

Research deterministic column fanout risk signals and AI context surfacing

IDCONTEXT_CATALOG_NIMBLE-RESEARCH_DETERMINISTIC_COLUMN_FANOUT_RISK_SIGNALS_AND_AI_CONTEXT_SURFACING
Statuscompleted
Priorityp2
Milestonem1-ft-analytics-analyst-pilot
Ownerdata-ai-engineer-architect
Initiativegrain-fanout-risk
Completed bydave
Completed2026-03-25

Problem

Product and AI teams need a clear model for where fanout risk “lives” (column vs join vs query), what can be inferred deterministically from profiling and dbt, and how to surface that to LLM agents without drowning prompts or lying about confidence.

This task captures research and recommendations; implementation is split across existing grain/fanout work, schema context formatting, and future dbt ingestion.

Context

Repo canon (read these first)

Artifact What it says
nimble/problems/fanout_deep_dive.md Fanout = join multiplies rows away from intended base grain; GROUP BY does not auto-fix; common patterns (1:N detail, M:N, snapshots, duplicate dimension keys).
tasks/.../grain-fanout-risk/spec.md Phased plan: grain candidate → join multiplicity from uniqueness ratios → fanout risk scoring → compile/MCP integration.
nimble/problems/fivetran_dbt_packages_fanout_analysis.md Real packages: safe “aggregate-then-join” vs latent risks (e.g. multi-row per key when grouping dimensions are wider than join key).
nimble/practices/policies/table_classes_and_keys.md Surrogate IDs, hidden fanout, relationship lint mental model.
ai_notes/ai/AI_CONTEXT_ARCHITECTURE.md Profiling layers 1–6, catalog → MCP, schema summary format; TODO block for grain + fanout signals.
dataface/core/inspect/join_multiplicity.py Deterministic edge classification: classify_multiplicity(left_uniqueness, right_uniqueness) with threshold 0.99; fanout_factor from row counts + multiplicity.
dataface/core/inspect/fanout_risk.py Risk levels from multiplicity + aggregation hint + fanout factor.
dataface/ai/schema_context.py Token-efficient schema string; table grain in header; no relationship/fanout section yet (see task surface-join-multiplicity-in-ai-schema-context-and-clarify-fk-cardinality-contract.md).

Prior implementation task


Research findings

2026-03-26 addendum: the newer query-validator direction shifts the primary detector for fanout from catalog-side edge scoring to query-structural analysis. The findings below still hold for context design, but they should now be read mainly as support for severity, explanation, and agent ergonomics rather than as the first detector of query risk.

1. Are certain columns “more at risk” of fanout?

Fanout is not a property of a single column in isolation. It is a property of a join path relative to an intended grain and a query shape (especially aggregates on measures defined at another grain).

That said, column-level signals are strong priors for which joins will be dangerous:

Signal (deterministic from profiler) Why it matters
Uniqueness ratio below ~1.0 on a join key That side can emit multiple rows per key value → classic 1:N or N:M “many” side.
key_role / naming FK heuristics Marks columns that commonly participate in multiplicity-changing joins.
role == measure on a table whose grain is coarser than a joined table Summing those measures after joining “down” to finer grain → inflation (needs query + grain + graph to flag, not column alone).
Composite / surrogate keys (Nimble table-class guidance) Can hide duplicates or wrong grain if uniqueness tests are missing.
Bridge-style tables (multiple FKs, none unique) Often N:M or ambiguous paths.

Conclusion: For catalog design, expose edges as source of truth for multiplicity and fanout factor; derive per-column annotations as a projection (“this column appears as the non-unique side of these edges”) for agent ergonomics.

For query validation, do not start from column or edge risk alone. Start from query structure:

  • is there a join?
  • is there aggregation?
  • are aggregate expressions owned by multiple tables?

Then use edges and column hints to strengthen or explain the warning.

2. What can be identified deterministically today?

Already available (stats + relationship list, no extra warehouse queries):

  • Table grain candidate (PK / composite heuristics).
  • Per relationship: multiplicity (one-to-one / one-to-many / many-to-one / many-to-many), fanout_factor, fanout_risk tier, coverage from null rates.
  • Compile-time warnings when compiled SQL mentions tables that participate in high-risk scored relationships (string/table-name matching — see tests in tests/core/test_fanout_integration.py).

Also now recognized as deterministically available from SQL + schema alone:

  • joined aggregate queries whose aggregate expressions are owned by columns from 2+ tables
  • cross-table aggregate expressions that cannot be cleanly bucketed to one table
  • some joined COUNT(*) grain ambiguity cases

Not fully deterministic without more structure:

  • “This query sums the wrong measure” — needs SQL analysis (e.g. which columns are aggregated, join order) or execution-layer checks.
  • Perfect N:M fanout magnitude — current fanout_factor for N:M is a rough bound; true fanout depends on key distribution (documented in join_multiplicity.py).

dbt / semantic metadata (when ingested):

  • unique + relationships tests → corroborate grain and FK direction.
  • Declared metrics / entities (MetricFlow, etc.) → measure home grain (highest confidence for “safe aggregation”).
  • Optional project conventions: meta on models/columns for grain, fanout policy, or Nimble rule IDs — not standard dbt, but machine-readable.

3. dbt comments vs summary vs structured meta?

Mechanism Pros Cons
Human dbt descriptions High trust, readable in PRs and docs Not structured; hard to lint; easy to drift
dbt meta / YAML tags Machine-readable; can drive catalog and lints Requires convention + docs; not universal across projects
Database column comments (persist_docs) Single source visible in warehouse Same drift issue; column-level ingestion still incomplete per AI_CONTEXT_ARCHITECTURE
Compiled catalog fields (relationships, fanout_risk, derived column tags) Deterministic from profiling + graph; versioned with inspect cache Needs UI/prompt discipline to avoid token blowup
Query-time / compile-time warnings Grounded in actual dashboard SQL Only as good as SQL parsing / table extraction

Recommended split:

  1. Deterministic catalog core — keep multiplicity and optional risk labels on relationships in inspect JSON / catalog.
  2. Agent ergonomics — add derived column_join_risk_hints (or similar) projecting edges onto columns for one-line schema tags.
  3. Narrative — use dbt descriptions for nuances (“latent fanout if both sale and capture exist”) where stats cannot know business rules.
  4. Prompt surface — extend format_schema_context with a capped Relationships appendix + optional per-table one-liners; keep MCP catalog tool as deep drill-down.

Important refinement: catalog fanout_risk should be treated as a supporting fact, not the validator's primary detector for query correctness.


Possible Solutions

Recommended (staged):

  1. Treat relationship edges as canonical for catalog-side multiplicity facts — Continue to align with grain-fanout-risk/spec.md and existing modules; do not claim “this column is always safe” without edge context.
  2. Add derived column-level hints in catalog JSON — For each column, list { partner_table, partner_column, multiplicity, fanout_risk.level } where this column is the join key on the non-unique side (or both sides for N:M). Pure derivation from existing relationship array.
  3. Query validator — Detect risky fanout patterns from SQL structure first: join + aggregation + aggregate ownership across multiple tables.
  4. Schema summary — Implement or finish the Relationships section + contract notes (tracked separately in inspect-profiler task above); cap rows and dedupe.
  5. dbt ingestion (separate initiative) — When schema.yml and manifest are merged, boost confidence: declared tests override weak heuristics; optional meta namespace for grain and fanout policy.
  6. AI context + validator docs — State explicitly: catalog fanout signals are edge-primary for context, but validator fanout detection starts from query structure and uses catalog facts to refine severity.

Alternatives considered:

  • Column-only risk score without graph — Rejected; produces false positives/negatives (e.g. PK column safe in isolation but dangerous if someone joins another table incorrectly).
  • LLM-inferred fanout from names only — Rejected for deterministic tier; OK as optional narrative on top of stats.

Plan

  1. Socialize this doc with context-catalog + MCP agent owners; confirm cap rules for Relationships in schema context.
  2. If approved, spawn implementation tasks: (A) catalog JSON projection column_join_risk_hints, (B) format_schema_context Relationships appendix, (C) optional dbt meta convention doc in Nimble or AI notes.
  3. Align with Layer 6 relationship mapping when it lands (richer edge list → richer column hints).
  4. Close this research task once the recommendations are captured in repo docs and at least one implementation follow-on is queued.

Implementation Progress

  • [x] Research findings captured in the task body and linked from AI_CONTEXT_ARCHITECTURE.md
  • [x] Follow-on implementation work queued in query-validator-foundation-and-first-integrations.md
  • [x] 2026-03-26 addendum captured: validator should prefer query-structural fanout detection over catalog-first edge matching

Research complete: 2026-03-24 — synthesized from Nimble fanout docs, grain-fanout spec, Fivetran package analysis, and current join_multiplicity / fanout_risk / schema_context code paths. Task status updated: 2026-03-25 — marked complete because the research artifact exists in-repo, is already referenced from architecture notes, and has a concrete implementation follow-on queued.


QA Exploration

N/A — research / architecture task only.

  • [x] QA exploration completed (or N/A for non-UI tasks)

Review Feedback

  • [ ] Review cleared