Dataface Tasks

Grain Inference and Fanout Risk — Specification

Status: ready-for-eng Owner: data-ai-engineer-architect Initiative: grain-fanout-risk Workstream: context-catalog-nimble


Problem Statement

AI agents and analysts generating SQL against profiled tables have no grain awareness. They don't know whether a table is one-row-per-order or one-row-per-order-line, and they can't detect when a join will multiply rows (fanout). This produces silent aggregate inflation — wrong numbers that look right.

The inspector already profiles individual columns (types, distributions, keys). But it says nothing about: 1. Table grain — what entity does each row represent? 2. Join multiplicity — will joining table A to table B produce 1:1, 1:N, or N:M rows? 3. Fanout risk — which query patterns are dangerous for a given table pair?

Goal

Add grain candidate and fanout risk metadata to the inspector output so that downstream consumers (MCP tools, compile-time warnings, dashboard generation) can: - Know a table's grain before writing queries - Assess join safety between table pairs - Get warnings on risky aggregate patterns


Non-Goals

These are explicitly out of scope for this initiative:

  1. Full semantic layer / metrics definitions — We don't define business metrics (revenue = SUM(amount)). That's dbt Semantic Layer territory. We only flag when a metric would be wrong due to grain issues.

  2. Automatic query rewriting — We surface warnings and metadata. We don't rewrite user SQL to add pre-aggregation CTEs or change join order. That's a future concern.

  3. Cross-database relationship discovery — Relationships are inferred within a single database/schema. Cross-database joins (e.g., Snowflake share → BigQuery) are out of scope.

  4. dbt manifest parsing for declared relationships — Layer 6 relationship mapping (separate initiative) handles declared FK ingestion from dbt. This initiative consumes that output but doesn't own dbt manifest parsing.

  5. Real-time / streaming grain tracking — Grain is assessed at profile time. We don't monitor grain changes in real-time as data loads.

  6. Enforcing grain in user SQL — We warn, we don't block. No compile-time errors that prevent queries from running — only warnings with severity levels.

  7. Multi-hop join path analysis — Phase 1 considers direct joins (A→B). Transitive join paths (A→B→C) are deferred to Phase 3.


Phased Module Plan

Phase 1: Grain Candidate Inference (single-table)

Goal: For each profiled table, emit a grain object in the inspector output that describes what entity each row represents.

Inputs (already available from inspector): - Column key_role (primary_key, foreign_key, none) - Column uniqueness_ratio and distinct_count - Column role (identifier, dimension, time, measure, text) - Column names (naming convention signals) - row_count

Algorithm: 1. Primary key detection — already exists (key_role=primary_key). The PK column(s) define the grain. 2. Composite grain detection — when no single PK exists, find the minimal set of columns where DISTINCT(col1, col2, ...) == row_count. Use heuristics: - Prefer columns with key_role=foreign_key + one role=time column (classic fact table grain: order_id + date) - Prefer columns with role=identifier over role=dimension - Prefer fewer columns (Occam's razor) 3. Grain confidence — score 0.0–1.0 based on: - 1.0: single PK column, 100% unique, name matches id or {table}_id - 0.9: composite key, all FK/identifier columns, uniqueness confirmed - 0.7: heuristic-based (naming + stats, not confirmed via query) - 0.5: ambiguous (multiple candidates, no clear winner) 4. Grain label — human-readable string: "one row per order", "one row per (order_id, line_item_id)"

Output schema (added to TableInspection.to_dict()):

{
  "grain": {
    "columns": ["order_id"],
    "label": "one row per order",
    "confidence": 0.95,
    "source": "primary_key",
    "is_composite": false
  }
}

New module: dataface/core/inspect/grain_detector.py - GrainCandidate dataclass - detect_grain(columns: list[ColumnInspection], row_count: int, table_name: str) -> GrainCandidate - Pure function, no DB queries in Phase 1 (uses already-profiled stats)

Verification query (optional, off by default):

SELECT COUNT(*) = COUNT(DISTINCT (col1, col2)) FROM table

When enabled (config flag), runs post-profiling to confirm grain hypothesis.

Tests: - Single PK table (orders with order_id) → confidence 1.0 - Composite grain (order_lines with order_id + line_id) → confidence 0.9 - Fact table with date grain (events with user_id + event_date) → confidence 0.7–0.9 - Ambiguous table (wide denormalized mart) → confidence 0.5 - Table with no clear grain (all low-uniqueness columns) → confidence < 0.5, grain = null


Phase 2: Join Multiplicity Profiling (table pairs)

Goal: For each pair of tables that share a relationship (from Layer 6 or naming convention), compute join multiplicity metadata.

Dependency: Layer 6 relationship mapping initiative (provides the list of candidate relationships). Phase 2 can start with naming-convention-based relationships as a fallback.

Algorithm: 1. Relationship source — consume Layer 6 output: {left_table, left_column, right_table, right_column, confidence, source} 2. Multiplicity classification — for each relationship, determine: - 1:1 — both sides unique on the join key - 1:N — left side unique, right side has duplicates - N:1 — right side unique, left side has duplicates - N:M — neither side unique (bridge table needed) 3. Stats-based detection (no extra queries needed): - Left uniqueness: left_column.uniqueness_ratio (from profiler) - Right uniqueness: right_column.uniqueness_ratio - If left uniqueness ≈ 1.0 and right < 1.0 → 1:N - Threshold: uniqueness > 0.99 counts as "unique side" 4. Fanout factor estimate: - For 1:N joins: right_table.row_count / left_table.row_count gives average fanout - For N:M: product of both sides' average group sizes 5. Coverage — what percentage of left-side rows have a match on the right? (null FK ratio from profiler stats)

Output schema (new top-level section in inspect.json):

{
  "relationships": [
    {
      "left_table": "orders",
      "left_column": "order_id",
      "right_table": "order_items",
      "right_column": "order_id",
      "multiplicity": "1:N",
      "fanout_factor": 3.2,
      "left_coverage": 1.0,
      "right_coverage": 0.98,
      "confidence": 0.9,
      "source": "naming_convention"
    }
  ]
}

New module: dataface/core/inspect/join_multiplicity.py - JoinProfile dataclass - profile_join(left_col: ColumnInspection, right_col: ColumnInspection, left_rows: int, right_rows: int) -> JoinProfile - classify_multiplicity(left_uniqueness: float, right_uniqueness: float) -> str

Tests: - orders ↔ order_items → 1:N, fanout_factor ≈ N - orders ↔ customers → N:1 - users ↔ users (self-join on same PK) → 1:1 - order_tags ↔ tag_definitions → N:M


Phase 3: Fanout Risk Scoring and Warnings

Goal: Given a query pattern (or a proposed join), score the fanout risk and surface warnings.

Risk scoring heuristic:

Signal Risk contribution
Join is 1:N and query SUMs a left-side measure HIGH — classic fanout inflation
Join is N:M HIGH — almost always needs bridge or pre-aggregation
Join is 1:N but no aggregation LOW — row multiplication is expected (detail query)
Join is 1:1 NONE — safe
Join is N:1 (lookup/dimension join) NONE — safe
Grain-breaking join: joining below the table's declared grain MEDIUM — may be intentional
High fanout factor (>10x row multiplication) Multiplier on existing risk

Risk levels: - none — safe join, no action needed - low — join changes grain but may be intentional - medium — aggregation after this join may be incorrect - high — strong likelihood of aggregate inflation - critical — N:M join with aggregation, almost certainly wrong

Output schema (per-relationship extension):

{
  "fanout_risk": {
    "level": "high",
    "reason": "1:N join with left-side measure aggregation",
    "recommendation": "Pre-aggregate order_items to order_id grain before joining"
  }
}

Integration points: 1. MCP toolsget_table_context includes grain + fanout metadata 2. Compile-time warnings — when a dashboard query joins tables with fanout risk 3. Inspector dashboard — visual indicator on relationship panel

New module: dataface/core/inspect/fanout_risk.py - FanoutRisk dataclass (level, reason, recommendation) - score_fanout_risk(join_profile: JoinProfile, query_has_aggregation: bool) -> FanoutRisk

Tests: - 1:1 join → risk = none - N:1 dimension lookup → risk = none - 1:N with SUM on left measure → risk = high - 1:N detail query (no aggregation) → risk = low - N:M with any aggregation → risk = critical


Phase 4: Multi-Hop Path Analysis (future)

Goal: Extend risk scoring to multi-table join paths (A→B→C).

Deferred — not in scope for M1 pilot. Tracked as a future opportunity in mx-far-future-ideas.

Key considerations for when this is picked up: - Transitive fanout: 1:N × 1:N = 1:N² — risk compounds - Join ordering matters: aggregating between hops changes the grain - Graph traversal: shortest safe path between two tables


Implementation Sequencing

Phase Module Depends on Estimated effort Milestone
1 grain_detector.py Existing profiler output 2–3 days M1
2 join_multiplicity.py Phase 1 + Layer 6 (or naming fallback) 3–4 days M1
3 fanout_risk.py Phase 2 2–3 days M1
4 Multi-hop paths Phase 3 TBD M2+

Phases 1–3 target M1 (pilot). Phase 4 is deferred.

Contract Changes

  • PROFILER_CONTRACT_VERSION bumps from 1.01.1 (additive: new optional grain and relationships keys)
  • Existing consumers are unaffected (new keys are optional per versioning policy)
  • inspector_schema.md updated with new sections

Acceptance Criteria

  1. inspect.json includes grain object for every profiled table
  2. inspect.json includes relationships array with multiplicity and fanout metadata
  3. Grain detection has ≥ 80% accuracy on pilot datasets (measured by manual review)
  4. Fanout risk warnings fire correctly for the known-risky patterns from the Fivetran dbt packages analysis
  5. MCP get_table_context tool returns grain + fanout data
  6. At least one compile-time warning path is wired up for dashboard queries