Float confidence scores for statistical column characteristics
Problem
The profiler's statistical characteristics (is_sequential, has_outliers, is_skewed, etc.) are currently binary booleans. This throws away useful information — a column with 2 gaps in 10k rows is very different from one with a perfect contiguous range, but both get is_sequential = True. Binary flags also can't degrade through SQL transforms: after a WHERE filter that removes 15% of rows, is the column still sequential? Binary gives no answer; a float score naturally drops from 0.99 to 0.84.
Float scores also eliminate the need for a separate "confidence": "inherited" metadata field when propagating through dbt DAGs — the score is the confidence. Different consumers can pick appropriate thresholds: a data quality lint requires is_sequential > 0.95, while a chart renderer just needs > 0.7.
Context
- Current implementation:
quality_detector.pycomputes binary flags inColumnClassification, copied ontoColumnInspection. Each flag uses threshold logic that already produces intermediate continuous values (e.g.,is_sequentialchecksmax - min + 1 == count,has_outlierschecks> 3σ from mean). - Downstream consumers:
schema_context.py(AI context),decisions.py(chart formatting), IDE extension (table profiling display), and the planned static propagation engine. - Design doc:
ai_notes/features/SEMANTIC_TYPE_PROPAGATION.md— Tier 3 section covers degradation rules and consumer threshold examples. - Related tasks: Static semantic type propagation (context-catalog-nimble, m2), On-demand result-set profiling (dft-core, m2).
Possible Solutions
Recommended: Change flag types from bool to float in ColumnInspection and ColumnClassification. The detectors already compute intermediate continuous values — expose them instead of thresholding to binary. Backward-compatible: consumers that only check truthiness (if col.has_outliers) still work since 0.0 is falsy and any positive float is truthy.
Alternative: Keep booleans, add separate *_confidence fields alongside each flag. Doubles the field count, clutters the API, and consumers still have to check two fields. Worse in every way.
Plan
Files to modify:
-
dataface/core/inspect/quality_detector.py- ChangeColumnClassificationflags frombooltofloat- Update each detector to return a 0.0–1.0 score instead of thresholding:is_sequential:1.0 - (gaps / count)where gaps =max - min + 1 - countis_incremental: fraction of consecutive value pairs that are strictly increasinghas_outliers: normalized distance of most extreme value from mean, capped at 1.0 (e.g.,min(1.0, max_deviation / (10 * stddev)))is_skewed:min(1.0, |mean - median| / stddev)normalizedis_zero_inflated: zero-fraction mapped to score (0.3–0.9 → 0.0–1.0)is_normalized: tightness of clustering in [0, 1] rangeis_non_negative: 1.0 ifmin >= 0, else 0.0
-
dataface/core/inspect/inspector.py- UpdateColumnInspectionflag types frombooltofloat- Updateto_dict()/to_json_dict()serialization -
dataface/ai/schema_context.py- Update context formatting to use thresholds for display (e.g., show flag label only if score > 0.5) -
dataface/core/render/chart/decisions.py- If consuming flags, use threshold checks instead of truthiness -
Tests - Update all flag assertions from
assert ci.is_sequential is Truetoassert ci.is_sequential > 0.9- Add tests for intermediate score values (e.g., column with some gaps should get 0.7–0.9, not 1.0)
Implementation Progress
Review Feedback
- [ ] Review cleared