AI_CONTEXT grain and fanout risk signals (beta subset)
Problem
AI agents generating SQL queries have no metadata about table grain, join multiplicity, or fanout risk. When an agent joins two tables without understanding their cardinality relationship, it can silently produce duplicated rows that inflate aggregates — a dangerous class of error because the results look plausible but are wrong. Without grain and fanout signals in AI_CONTEXT, every AI-generated join is a potential correctness risk that analysts must manually verify.
Context
- Every profiled table has a
grainobject (columns, label, confidence, source). - Table pairs with relationships have
multiplicityandfanout_factormetadata. - Fanout risk levels (none/low/medium/high/critical) with actionable recommendations.
- MCP
get_table_contextreturns grain data. - At least one compile-time warning path fires for high-risk joins.
- 2026-03-26 follow-on direction: these catalog signals remain useful, but validator-quality fanout detection should start from SQL structure rather than from edge risk alone.
Possible Solutions
Recommended: Pure-function modules consuming existing profiler stats (no extra DB queries). Join multiplicity and fanout risk are computed from Layer 6 relationship edges + column uniqueness ratios already in TableInspection. Compile-time warnings use simple SQL table-name matching against scored relationships.
Alternative considered: Live DB queries for exact row-count verification — rejected per no-extra-queries constraint and profiler-stats-only design.
Plan
- Create
join_multiplicity.pywithJoinProfiledataclass andclassify_multiplicity()/profile_join()/enrich_relationships()pure functions. - Create
fanout_risk.pywithFanoutRiskdataclass,score_fanout_risk(), andcheck_query_fanout_warnings()for compile-time warnings. - Wire
enrich_relationships()+score_fanout_risk()into MCP_detect_catalog_relationships(). - Add
enrich_compile_warnings()to compiler for post-compile fanout warning injection. - Update
inspector_schema.mdwith grain, relationship, join_profile, and fanout_risk documentation.
Implementation Progress
Implementation phases
Phase 1: Grain candidate inference (~2–3 days)
- Create
dataface/core/inspect/grain_detector.pywithGrainCandidatedataclass. - Implement
detect_grain()— PK detection, composite grain heuristics, confidence scoring. - Wire into
TableInspector.inspect_table()to populategrainonTableInspection. - Add
graintoTableInspection.to_dict()andto_json_dict(). - Tests: single PK, composite grain, fact table with date, ambiguous tables, no-grain tables.
- Optional: verification query behind config flag to confirm grain via
COUNT(DISTINCT ...).
Phase 2: Join multiplicity profiling (~3–4 days)
- Create
dataface/core/inspect/join_multiplicity.pywithJoinProfiledataclass. - Implement
classify_multiplicity()from uniqueness ratios (1:1, 1:N, N:1, N:M). - Implement
profile_join()to compute fanout factor and coverage from existing stats. - Consume Layer 6 relationship output (or naming-convention fallback if Layer 6 not ready).
- Add
relationshipsarray to inspect.json output. - Tests: 1:N orders→items, N:1 orders→customers, 1:1 self-join, N:M bridge scenarios.
Phase 3: Fanout risk scoring and warnings (~2–3 days)
- Create
dataface/core/inspect/fanout_risk.pywithFanoutRiskdataclass. - Implement
score_fanout_risk()— risk level from join profile + aggregation context. - Wire risk scores into relationship output.
- Add compile-time warning hook for dashboard queries with high-risk joins.
- Surface in MCP tool responses.
- Tests: safe joins (1:1, N:1), risky joins (1:N + SUM), critical joins (N:M + agg).
- Validate against known-risky patterns from Fivetran dbt packages analysis.
Contract changes
PROFILER_CONTRACT_VERSIONbump:1.0→1.1(additive, optional keys only).inspector_schema.mdupdated withgrainandrelationshipssections.-
Existing consumers unaffected per versioning policy.
-
Layer 6 relationship mapping task (M1-AICONTEXT-002) — Phase 2 can start with naming fallback.
- dft-core compile warning pathways — Phase 3 integration point.
-
AI_CONTEXT contract v1 (M1-AICONTEXT-001) — grain/fanout fields added to schema.
-
inspect.jsonincludesgrainobject for every profiled table. inspect.jsonincludesrelationshipsarray with multiplicity and fanout metadata.- Grain detection ≥ 80% accuracy on pilot datasets (manual review).
- Fanout warnings fire for known-risky Fivetran dbt package patterns.
- MCP
get_table_contextreturns grain data. - One compile-time warning path wired up for dashboard queries.
Execution narrative
Phase 1 (grain candidate inference) delivered end-to-end:
- grain_detector.py: pure function detect_grain() with 3-tier priority (PK → uniqueness → composite heuristic), named confidence constants, GrainCandidate frozen dataclass with to_dict()/to_json_dict().
- Wired into TableInspector.inspect_table() (step 12), TableInspection.to_dict()/to_json_dict() as optional key, MCP _enrich_table_from_profile, and schema_context._format_table_header.
- Contract version bumped to 1.1 (additive, backward compatible).
- 20 unit tests (detector) + 7 integration tests (full pipeline) + 3 contract tests (serialization). All 331 inspect-related tests pass. cbox review: APPROVED.
- Phase 2/3 deferred: Layer 6 relationship mapping (M1-AICONTEXT-002) not yet available. Creating modules without production wiring would violate the no-dead-code constraint.
Phase 2 (join multiplicity profiling) delivered:
- join_multiplicity.py: JoinProfile frozen dataclass, classify_multiplicity() for 4-way cardinality (1:1/1:N/N:1/N:M) from uniqueness ratios, profile_join() computing fanout factor + coverage, enrich_relationships() to extend relationship edges with join profiles.
- Uniqueness threshold at 0.99 (consistent with relationship detector). Fanout factor: 1.0 for safe joins (1:1, N:1), right_rows/left_rows for 1:N, max/min ratio for N:M.
- 22 unit tests: dataclass serialization, all 4 multiplicity classifications, boundary conditions, fanout computation, coverage from null percentages, zero-row safety, enrich_relationships integration.
Phase 3 (fanout risk scoring and warnings) delivered:
- fanout_risk.py: FanoutRisk frozen dataclass, score_fanout_risk() implementing the spec's risk heuristic table (1:1/N:1 → none, 1:N no-agg → low, 1:N + agg → medium/high, N:M → high/critical), check_query_fanout_warnings() for compile-time SQL-level warnings.
- Risk levels: none/low/medium/high/critical with human-readable reasons and actionable recommendations (e.g., "Pre-aggregate the many-side to the join key grain before joining").
- High fanout threshold at 10x for escalation from medium to high.
- 20 unit tests: dataclass, all risk level combinations, compile-time warning generation, table-name matching in SQL, edge cases.
Known limitation captured after delivery: - The compile warning path is relationship-first and SQL-string based. It is useful as coarse context surfacing, but it is weaker than a validator that detects joined multi-table aggregate ownership directly from SQL structure.
Integration wiring:
- MCP _detect_catalog_relationships(): enriches relationship edges with join_profile and fanout_risk sub-dicts using enrich_relationships() + score_fanout_risk().
- Compiler enrich_compile_warnings(): post-compile hook that injects fanout warnings into CompileResult.warnings by checking SQL queries for high/critical risk table pairs.
- inspector_schema.md updated with full documentation for grain, relationships, join_profile, and fanout_risk sections.
- 6 integration tests: full detect→enrich→score pipeline, compile warning injection (risky/safe/failed states).
All 48 new tests + 2268 existing tests pass. No regressions. cbox review pending.
- Full spec: initiatives/grain-fanout-risk/spec.md
- Fanout research: nimble/problems/fanout_deep_dive.md
- Fivetran analysis: nimble/problems/fivetran_dbt_packages_fanout_analysis.md
Review Feedback
- [ ] Review cleared