On-demand result-set profiling in chart decisions pipeline
Problem
The chart decisions pipeline (decisions.py and chart_type_detection.py) already scans query result data to classify columns and pick formatting. But it computes only basic characteristics — dtype, cardinality, min/max, is_integer — and uses column-name regex for semantic detection (currency, percentage). This leads to:
- Wrong axis sort order — ordinal columns like priority (low/medium/high) or T-shirt sizes (S/M/L/XL) sort alphabetically, producing confusing bar charts.
- Blunt chart type detection — uses hard cardinality cutoffs (>20 → scatter) without understanding if a numeric column is truly continuous or categorical-encoded.
- Sparse columns as axis candidates — a 95%-null column can get picked as the y-axis because nulls aren't checked.
- Brittle semantic detection —
_CURRENCY_PATTERNSregex matches column names only; misses currency columns with non-standard names and false-positives on columns that happen to contain "amount".
The inspector already solves all of these with is_ordinal, distribution, completeness, and value-pattern-based semantic detection — but those run during full table profiling. The same logic can run cheaply on the 50-500 rows already in memory after query execution.
Context
- Current profiling:
decisions.py:_profile_columns()(line 196) buildsColumnProfileper column.chart_type_detection.py:_analyze_column_types()(line 279) classifies columns,_analyze_cardinality()counts distinct values. - Inspector detectors:
quality_detector.pycomputesis_ordinalby matching values against known ordinal vocabularies.semantic_detector.pydoes value-pattern analysis for currency, percentage, etc. - Relationship to static propagation: Static propagation carries semantic types through SQL ASTs (via SQLGlot, independently of ASQL) without re-scanning. On-demand profiling is complementary — it handles any query whose result data is in memory, and computes statistical properties (distribution, completeness) that inherently change after transforms and can't be propagated statically.
Possible Solutions
Recommended: Expand ColumnProfile and _profile_columns() in decisions.py. Add flags directly, reusing detection logic from the inspector's quality_detector.py and semantic_detector.py. Keep it lightweight — no new module, no dependency on the inspector, just the same pattern-matching logic inlined or extracted into a shared utility.
Alternative: Call the full inspector on the result set. Overkill — the inspector builds ColumnInspection with 30+ fields, runs SQL queries, etc. We just need 3-4 cheap derived flags.
Plan
Files to modify:
-
dataface/core/render/chart/decisions.py- ExpandColumnProfilewith:is_ordinal: bool,null_ratio: float,distribution: str(categorical/continuous/unique/constant) - In_profile_columns(): computenull_ratiofrom sample; detectis_ordinalby checking distinct values against known ordinal patterns (days of week, months, T-shirt sizes, low/medium/high/critical, Likert scales); classifydistributionusing distinct-count/row-count ratio - Replace_CURRENCY_PATTERNSregex with value-pattern analysis (check for $ prefix, consistent decimal places, non-negative range) as primary signal, keep regex as secondary - Replace_PERCENTAGE_PATTERNSregex with value-range analysis (all values 0-1 or 0-100) -
dataface/core/compile/chart_type_detection.py- Usedistributionto improve chart type heuristics (categorical int column → bar, not scatter) - Skip high-null columns when picking axis candidates -
dataface/core/render/chart/decisions.py— axis settings - Whenis_ordinal, apply ordinal sort order to axis instead of alphabetical - Whennull_ratio > 0.9, deprioritize column as axis candidate -
Tests — ordinal sort for known vocabularies, distribution classification, null-column skipping, currency/percentage detection without regex
Implementation Progress
Review Feedback
- [ ] Review cleared