Grain Inference and Fanout Risk — Specification
Status: ready-for-eng Owner: data-ai-engineer-architect Initiative: grain-fanout-risk Workstream: context-catalog-nimble
Problem Statement
AI agents and analysts generating SQL against profiled tables have no grain awareness. They don't know whether a table is one-row-per-order or one-row-per-order-line, and they can't detect when a join will multiply rows (fanout). This produces silent aggregate inflation — wrong numbers that look right.
The inspector already profiles individual columns (types, distributions, keys). But it says nothing about: 1. Table grain — what entity does each row represent? 2. Join multiplicity — will joining table A to table B produce 1:1, 1:N, or N:M rows? 3. Fanout risk — which query patterns are dangerous for a given table pair?
Goal
Add grain candidate and fanout risk metadata to the inspector output so that downstream consumers (MCP tools, compile-time warnings, dashboard generation) can: - Know a table's grain before writing queries - Assess join safety between table pairs - Get warnings on risky aggregate patterns
Non-Goals
These are explicitly out of scope for this initiative:
-
Full semantic layer / metrics definitions — We don't define business metrics (revenue = SUM(amount)). That's dbt Semantic Layer territory. We only flag when a metric would be wrong due to grain issues.
-
Automatic query rewriting — We surface warnings and metadata. We don't rewrite user SQL to add pre-aggregation CTEs or change join order. That's a future concern.
-
Cross-database relationship discovery — Relationships are inferred within a single database/schema. Cross-database joins (e.g., Snowflake share → BigQuery) are out of scope.
-
dbt manifest parsing for declared relationships — Layer 6 relationship mapping (separate initiative) handles declared FK ingestion from dbt. This initiative consumes that output but doesn't own dbt manifest parsing.
-
Real-time / streaming grain tracking — Grain is assessed at profile time. We don't monitor grain changes in real-time as data loads.
-
Enforcing grain in user SQL — We warn, we don't block. No compile-time errors that prevent queries from running — only warnings with severity levels.
-
Multi-hop join path analysis — Phase 1 considers direct joins (A→B). Transitive join paths (A→B→C) are deferred to Phase 3.
Phased Module Plan
Phase 1: Grain Candidate Inference (single-table)
Goal: For each profiled table, emit a grain object in the inspector output that describes what entity each row represents.
Inputs (already available from inspector):
- Column key_role (primary_key, foreign_key, none)
- Column uniqueness_ratio and distinct_count
- Column role (identifier, dimension, time, measure, text)
- Column names (naming convention signals)
- row_count
Algorithm:
1. Primary key detection — already exists (key_role=primary_key). The PK column(s) define the grain.
2. Composite grain detection — when no single PK exists, find the minimal set of columns where DISTINCT(col1, col2, ...) == row_count. Use heuristics:
- Prefer columns with key_role=foreign_key + one role=time column (classic fact table grain: order_id + date)
- Prefer columns with role=identifier over role=dimension
- Prefer fewer columns (Occam's razor)
3. Grain confidence — score 0.0–1.0 based on:
- 1.0: single PK column, 100% unique, name matches id or {table}_id
- 0.9: composite key, all FK/identifier columns, uniqueness confirmed
- 0.7: heuristic-based (naming + stats, not confirmed via query)
- 0.5: ambiguous (multiple candidates, no clear winner)
4. Grain label — human-readable string: "one row per order", "one row per (order_id, line_item_id)"
Output schema (added to TableInspection.to_dict()):
{
"grain": {
"columns": ["order_id"],
"label": "one row per order",
"confidence": 0.95,
"source": "primary_key",
"is_composite": false
}
}
New module: dataface/core/inspect/grain_detector.py
- GrainCandidate dataclass
- detect_grain(columns: list[ColumnInspection], row_count: int, table_name: str) -> GrainCandidate
- Pure function, no DB queries in Phase 1 (uses already-profiled stats)
Verification query (optional, off by default):
SELECT COUNT(*) = COUNT(DISTINCT (col1, col2)) FROM table
When enabled (config flag), runs post-profiling to confirm grain hypothesis.
Tests: - Single PK table (orders with order_id) → confidence 1.0 - Composite grain (order_lines with order_id + line_id) → confidence 0.9 - Fact table with date grain (events with user_id + event_date) → confidence 0.7–0.9 - Ambiguous table (wide denormalized mart) → confidence 0.5 - Table with no clear grain (all low-uniqueness columns) → confidence < 0.5, grain = null
Phase 2: Join Multiplicity Profiling (table pairs)
Goal: For each pair of tables that share a relationship (from Layer 6 or naming convention), compute join multiplicity metadata.
Dependency: Layer 6 relationship mapping initiative (provides the list of candidate relationships). Phase 2 can start with naming-convention-based relationships as a fallback.
Algorithm:
1. Relationship source — consume Layer 6 output: {left_table, left_column, right_table, right_column, confidence, source}
2. Multiplicity classification — for each relationship, determine:
- 1:1 — both sides unique on the join key
- 1:N — left side unique, right side has duplicates
- N:1 — right side unique, left side has duplicates
- N:M — neither side unique (bridge table needed)
3. Stats-based detection (no extra queries needed):
- Left uniqueness: left_column.uniqueness_ratio (from profiler)
- Right uniqueness: right_column.uniqueness_ratio
- If left uniqueness ≈ 1.0 and right < 1.0 → 1:N
- Threshold: uniqueness > 0.99 counts as "unique side"
4. Fanout factor estimate:
- For 1:N joins: right_table.row_count / left_table.row_count gives average fanout
- For N:M: product of both sides' average group sizes
5. Coverage — what percentage of left-side rows have a match on the right? (null FK ratio from profiler stats)
Output schema (new top-level section in inspect.json):
{
"relationships": [
{
"left_table": "orders",
"left_column": "order_id",
"right_table": "order_items",
"right_column": "order_id",
"multiplicity": "1:N",
"fanout_factor": 3.2,
"left_coverage": 1.0,
"right_coverage": 0.98,
"confidence": 0.9,
"source": "naming_convention"
}
]
}
New module: dataface/core/inspect/join_multiplicity.py
- JoinProfile dataclass
- profile_join(left_col: ColumnInspection, right_col: ColumnInspection, left_rows: int, right_rows: int) -> JoinProfile
- classify_multiplicity(left_uniqueness: float, right_uniqueness: float) -> str
Tests: - orders ↔ order_items → 1:N, fanout_factor ≈ N - orders ↔ customers → N:1 - users ↔ users (self-join on same PK) → 1:1 - order_tags ↔ tag_definitions → N:M
Phase 3: Fanout Risk Scoring and Warnings
Goal: Given a query pattern (or a proposed join), score the fanout risk and surface warnings.
Risk scoring heuristic:
| Signal | Risk contribution |
|---|---|
| Join is 1:N and query SUMs a left-side measure | HIGH — classic fanout inflation |
| Join is N:M | HIGH — almost always needs bridge or pre-aggregation |
| Join is 1:N but no aggregation | LOW — row multiplication is expected (detail query) |
| Join is 1:1 | NONE — safe |
| Join is N:1 (lookup/dimension join) | NONE — safe |
| Grain-breaking join: joining below the table's declared grain | MEDIUM — may be intentional |
| High fanout factor (>10x row multiplication) | Multiplier on existing risk |
Risk levels:
- none — safe join, no action needed
- low — join changes grain but may be intentional
- medium — aggregation after this join may be incorrect
- high — strong likelihood of aggregate inflation
- critical — N:M join with aggregation, almost certainly wrong
Output schema (per-relationship extension):
{
"fanout_risk": {
"level": "high",
"reason": "1:N join with left-side measure aggregation",
"recommendation": "Pre-aggregate order_items to order_id grain before joining"
}
}
Integration points:
1. MCP tools — get_table_context includes grain + fanout metadata
2. Compile-time warnings — when a dashboard query joins tables with fanout risk
3. Inspector dashboard — visual indicator on relationship panel
New module: dataface/core/inspect/fanout_risk.py
- FanoutRisk dataclass (level, reason, recommendation)
- score_fanout_risk(join_profile: JoinProfile, query_has_aggregation: bool) -> FanoutRisk
Tests: - 1:1 join → risk = none - N:1 dimension lookup → risk = none - 1:N with SUM on left measure → risk = high - 1:N detail query (no aggregation) → risk = low - N:M with any aggregation → risk = critical
Phase 4: Multi-Hop Path Analysis (future)
Goal: Extend risk scoring to multi-table join paths (A→B→C).
Deferred — not in scope for M1 pilot. Tracked as a future opportunity in mx-far-future-ideas.
Key considerations for when this is picked up: - Transitive fanout: 1:N × 1:N = 1:N² — risk compounds - Join ordering matters: aggregating between hops changes the grain - Graph traversal: shortest safe path between two tables
Implementation Sequencing
| Phase | Module | Depends on | Estimated effort | Milestone |
|---|---|---|---|---|
| 1 | grain_detector.py |
Existing profiler output | 2–3 days | M1 |
| 2 | join_multiplicity.py |
Phase 1 + Layer 6 (or naming fallback) | 3–4 days | M1 |
| 3 | fanout_risk.py |
Phase 2 | 2–3 days | M1 |
| 4 | Multi-hop paths | Phase 3 | TBD | M2+ |
Phases 1–3 target M1 (pilot). Phase 4 is deferred.
Contract Changes
PROFILER_CONTRACT_VERSIONbumps from1.0→1.1(additive: new optionalgrainandrelationshipskeys)- Existing consumers are unaffected (new keys are optional per versioning policy)
inspector_schema.mdupdated with new sections
Acceptance Criteria
inspect.jsonincludesgrainobject for every profiled tableinspect.jsonincludesrelationshipsarray with multiplicity and fanout metadata- Grain detection has ≥ 80% accuracy on pilot datasets (measured by manual review)
- Fanout risk warnings fire correctly for the known-risky patterns from the Fivetran dbt packages analysis
- MCP
get_table_contexttool returns grain + fanout data - At least one compile-time warning path is wired up for dashboard queries