Dataface Tasks

AI_CONTEXT grain and fanout risk signals (beta subset)

IDM1-AICONTEXT-006
Statuscompleted
Priorityp1
Milestonem1-ft-analytics-analyst-pilot
Ownerdata-ai-engineer-architect
Initiativegrain-fanout-risk

Problem

AI agents generating SQL queries have no metadata about table grain, join multiplicity, or fanout risk. When an agent joins two tables without understanding their cardinality relationship, it can silently produce duplicated rows that inflate aggregates — a dangerous class of error because the results look plausible but are wrong. Without grain and fanout signals in AI_CONTEXT, every AI-generated join is a potential correctness risk that analysts must manually verify.

Context

  • Every profiled table has a grain object (columns, label, confidence, source).
  • Table pairs with relationships have multiplicity and fanout_factor metadata.
  • Fanout risk levels (none/low/medium/high/critical) with actionable recommendations.
  • MCP get_table_context returns grain data.
  • At least one compile-time warning path fires for high-risk joins.
  • 2026-03-26 follow-on direction: these catalog signals remain useful, but validator-quality fanout detection should start from SQL structure rather than from edge risk alone.

Possible Solutions

Recommended: Pure-function modules consuming existing profiler stats (no extra DB queries). Join multiplicity and fanout risk are computed from Layer 6 relationship edges + column uniqueness ratios already in TableInspection. Compile-time warnings use simple SQL table-name matching against scored relationships.

Alternative considered: Live DB queries for exact row-count verification — rejected per no-extra-queries constraint and profiler-stats-only design.

Plan

  1. Create join_multiplicity.py with JoinProfile dataclass and classify_multiplicity()/profile_join()/enrich_relationships() pure functions.
  2. Create fanout_risk.py with FanoutRisk dataclass, score_fanout_risk(), and check_query_fanout_warnings() for compile-time warnings.
  3. Wire enrich_relationships() + score_fanout_risk() into MCP _detect_catalog_relationships().
  4. Add enrich_compile_warnings() to compiler for post-compile fanout warning injection.
  5. Update inspector_schema.md with grain, relationship, join_profile, and fanout_risk documentation.

Implementation Progress

Implementation phases

Phase 1: Grain candidate inference (~2–3 days)

  • Create dataface/core/inspect/grain_detector.py with GrainCandidate dataclass.
  • Implement detect_grain() — PK detection, composite grain heuristics, confidence scoring.
  • Wire into TableInspector.inspect_table() to populate grain on TableInspection.
  • Add grain to TableInspection.to_dict() and to_json_dict().
  • Tests: single PK, composite grain, fact table with date, ambiguous tables, no-grain tables.
  • Optional: verification query behind config flag to confirm grain via COUNT(DISTINCT ...).

Phase 2: Join multiplicity profiling (~3–4 days)

  • Create dataface/core/inspect/join_multiplicity.py with JoinProfile dataclass.
  • Implement classify_multiplicity() from uniqueness ratios (1:1, 1:N, N:1, N:M).
  • Implement profile_join() to compute fanout factor and coverage from existing stats.
  • Consume Layer 6 relationship output (or naming-convention fallback if Layer 6 not ready).
  • Add relationships array to inspect.json output.
  • Tests: 1:N orders→items, N:1 orders→customers, 1:1 self-join, N:M bridge scenarios.

Phase 3: Fanout risk scoring and warnings (~2–3 days)

  • Create dataface/core/inspect/fanout_risk.py with FanoutRisk dataclass.
  • Implement score_fanout_risk() — risk level from join profile + aggregation context.
  • Wire risk scores into relationship output.
  • Add compile-time warning hook for dashboard queries with high-risk joins.
  • Surface in MCP tool responses.
  • Tests: safe joins (1:1, N:1), risky joins (1:N + SUM), critical joins (N:M + agg).
  • Validate against known-risky patterns from Fivetran dbt packages analysis.

Contract changes

  • PROFILER_CONTRACT_VERSION bump: 1.01.1 (additive, optional keys only).
  • inspector_schema.md updated with grain and relationships sections.
  • Existing consumers unaffected per versioning policy.

  • Layer 6 relationship mapping task (M1-AICONTEXT-002) — Phase 2 can start with naming fallback.

  • dft-core compile warning pathways — Phase 3 integration point.
  • AI_CONTEXT contract v1 (M1-AICONTEXT-001) — grain/fanout fields added to schema.

  • inspect.json includes grain object for every profiled table.

  • inspect.json includes relationships array with multiplicity and fanout metadata.
  • Grain detection ≥ 80% accuracy on pilot datasets (manual review).
  • Fanout warnings fire for known-risky Fivetran dbt package patterns.
  • MCP get_table_context returns grain data.
  • One compile-time warning path wired up for dashboard queries.

Execution narrative

Phase 1 (grain candidate inference) delivered end-to-end: - grain_detector.py: pure function detect_grain() with 3-tier priority (PK → uniqueness → composite heuristic), named confidence constants, GrainCandidate frozen dataclass with to_dict()/to_json_dict(). - Wired into TableInspector.inspect_table() (step 12), TableInspection.to_dict()/to_json_dict() as optional key, MCP _enrich_table_from_profile, and schema_context._format_table_header. - Contract version bumped to 1.1 (additive, backward compatible). - 20 unit tests (detector) + 7 integration tests (full pipeline) + 3 contract tests (serialization). All 331 inspect-related tests pass. cbox review: APPROVED. - Phase 2/3 deferred: Layer 6 relationship mapping (M1-AICONTEXT-002) not yet available. Creating modules without production wiring would violate the no-dead-code constraint.

Phase 2 (join multiplicity profiling) delivered: - join_multiplicity.py: JoinProfile frozen dataclass, classify_multiplicity() for 4-way cardinality (1:1/1:N/N:1/N:M) from uniqueness ratios, profile_join() computing fanout factor + coverage, enrich_relationships() to extend relationship edges with join profiles. - Uniqueness threshold at 0.99 (consistent with relationship detector). Fanout factor: 1.0 for safe joins (1:1, N:1), right_rows/left_rows for 1:N, max/min ratio for N:M. - 22 unit tests: dataclass serialization, all 4 multiplicity classifications, boundary conditions, fanout computation, coverage from null percentages, zero-row safety, enrich_relationships integration.

Phase 3 (fanout risk scoring and warnings) delivered: - fanout_risk.py: FanoutRisk frozen dataclass, score_fanout_risk() implementing the spec's risk heuristic table (1:1/N:1 → none, 1:N no-agg → low, 1:N + agg → medium/high, N:M → high/critical), check_query_fanout_warnings() for compile-time SQL-level warnings. - Risk levels: none/low/medium/high/critical with human-readable reasons and actionable recommendations (e.g., "Pre-aggregate the many-side to the join key grain before joining"). - High fanout threshold at 10x for escalation from medium to high. - 20 unit tests: dataclass, all risk level combinations, compile-time warning generation, table-name matching in SQL, edge cases.

Known limitation captured after delivery: - The compile warning path is relationship-first and SQL-string based. It is useful as coarse context surfacing, but it is weaker than a validator that detects joined multi-table aggregate ownership directly from SQL structure.

Integration wiring: - MCP _detect_catalog_relationships(): enriches relationship edges with join_profile and fanout_risk sub-dicts using enrich_relationships() + score_fanout_risk(). - Compiler enrich_compile_warnings(): post-compile hook that injects fanout warnings into CompileResult.warnings by checking SQL queries for high/critical risk table pairs. - inspector_schema.md updated with full documentation for grain, relationships, join_profile, and fanout_risk sections. - 6 integration tests: full detect→enrich→score pipeline, compile warning injection (risky/safe/failed states).

All 48 new tests + 2268 existing tests pass. No regressions. cbox review pending.

Review Feedback

  • [ ] Review cleared