Dataface Tasks

Float confidence scores for statistical column characteristics

IDINSPECT_PROFILER-FLOAT_CONFIDENCE_SCORES_FOR_STATISTICAL_COLUMN_CHARACTERISTICS
Statusnot_started
Priorityp2
Milestonem2-internal-adoption-design-partners
Ownersr-engineer-architect

Problem

The profiler's statistical characteristics (is_sequential, has_outliers, is_skewed, etc.) are currently binary booleans. This throws away useful information — a column with 2 gaps in 10k rows is very different from one with a perfect contiguous range, but both get is_sequential = True. Binary flags also can't degrade through SQL transforms: after a WHERE filter that removes 15% of rows, is the column still sequential? Binary gives no answer; a float score naturally drops from 0.99 to 0.84.

Float scores also eliminate the need for a separate "confidence": "inherited" metadata field when propagating through dbt DAGs — the score is the confidence. Different consumers can pick appropriate thresholds: a data quality lint requires is_sequential > 0.95, while a chart renderer just needs > 0.7.

Context

  • Current implementation: quality_detector.py computes binary flags in ColumnClassification, copied onto ColumnInspection. Each flag uses threshold logic that already produces intermediate continuous values (e.g., is_sequential checks max - min + 1 == count, has_outliers checks > 3σ from mean).
  • Downstream consumers: schema_context.py (AI context), decisions.py (chart formatting), IDE extension (table profiling display), and the planned static propagation engine.
  • Design doc: ai_notes/features/SEMANTIC_TYPE_PROPAGATION.md — Tier 3 section covers degradation rules and consumer threshold examples.
  • Related tasks: Static semantic type propagation (context-catalog-nimble, m2), On-demand result-set profiling (dft-core, m2).

Possible Solutions

Recommended: Change flag types from bool to float in ColumnInspection and ColumnClassification. The detectors already compute intermediate continuous values — expose them instead of thresholding to binary. Backward-compatible: consumers that only check truthiness (if col.has_outliers) still work since 0.0 is falsy and any positive float is truthy.

Alternative: Keep booleans, add separate *_confidence fields alongside each flag. Doubles the field count, clutters the API, and consumers still have to check two fields. Worse in every way.

Plan

Files to modify:

  1. dataface/core/inspect/quality_detector.py - Change ColumnClassification flags from bool to float - Update each detector to return a 0.0–1.0 score instead of thresholding:

    • is_sequential: 1.0 - (gaps / count) where gaps = max - min + 1 - count
    • is_incremental: fraction of consecutive value pairs that are strictly increasing
    • has_outliers: normalized distance of most extreme value from mean, capped at 1.0 (e.g., min(1.0, max_deviation / (10 * stddev)))
    • is_skewed: min(1.0, |mean - median| / stddev) normalized
    • is_zero_inflated: zero-fraction mapped to score (0.3–0.9 → 0.0–1.0)
    • is_normalized: tightness of clustering in [0, 1] range
    • is_non_negative: 1.0 if min >= 0, else 0.0
  2. dataface/core/inspect/inspector.py - Update ColumnInspection flag types from bool to float - Update to_dict() / to_json_dict() serialization

  3. dataface/ai/schema_context.py - Update context formatting to use thresholds for display (e.g., show flag label only if score > 0.5)

  4. dataface/core/render/chart/decisions.py - If consuming flags, use threshold checks instead of truthiness

  5. Tests - Update all flag assertions from assert ci.is_sequential is True to assert ci.is_sequential > 0.9 - Add tests for intermediate score values (e.g., column with some gaps should get 0.7–0.9, not 1.0)

Implementation Progress

Review Feedback

  • [ ] Review cleared