Dataface Tasks

On-demand result-set profiling in chart decisions pipeline

IDDFT_CORE-ON_DEMAND_RESULT_SET_PROFILING_IN_CHART_DECISIONS_PIPELINE
Statusnot_started
Priorityp2
Milestonem2-internal-adoption-design-partners
Ownersr-engineer-architect

Problem

The chart decisions pipeline (decisions.py and chart_type_detection.py) already scans query result data to classify columns and pick formatting. But it computes only basic characteristics — dtype, cardinality, min/max, is_integer — and uses column-name regex for semantic detection (currency, percentage). This leads to:

  1. Wrong axis sort order — ordinal columns like priority (low/medium/high) or T-shirt sizes (S/M/L/XL) sort alphabetically, producing confusing bar charts.
  2. Blunt chart type detection — uses hard cardinality cutoffs (>20 → scatter) without understanding if a numeric column is truly continuous or categorical-encoded.
  3. Sparse columns as axis candidates — a 95%-null column can get picked as the y-axis because nulls aren't checked.
  4. Brittle semantic detection_CURRENCY_PATTERNS regex matches column names only; misses currency columns with non-standard names and false-positives on columns that happen to contain "amount".

The inspector already solves all of these with is_ordinal, distribution, completeness, and value-pattern-based semantic detection — but those run during full table profiling. The same logic can run cheaply on the 50-500 rows already in memory after query execution.

Context

  • Current profiling: decisions.py:_profile_columns() (line 196) builds ColumnProfile per column. chart_type_detection.py:_analyze_column_types() (line 279) classifies columns, _analyze_cardinality() counts distinct values.
  • Inspector detectors: quality_detector.py computes is_ordinal by matching values against known ordinal vocabularies. semantic_detector.py does value-pattern analysis for currency, percentage, etc.
  • Relationship to static propagation: Static propagation carries semantic types through SQL ASTs (via SQLGlot, independently of ASQL) without re-scanning. On-demand profiling is complementary — it handles any query whose result data is in memory, and computes statistical properties (distribution, completeness) that inherently change after transforms and can't be propagated statically.

Possible Solutions

Recommended: Expand ColumnProfile and _profile_columns() in decisions.py. Add flags directly, reusing detection logic from the inspector's quality_detector.py and semantic_detector.py. Keep it lightweight — no new module, no dependency on the inspector, just the same pattern-matching logic inlined or extracted into a shared utility.

Alternative: Call the full inspector on the result set. Overkill — the inspector builds ColumnInspection with 30+ fields, runs SQL queries, etc. We just need 3-4 cheap derived flags.

Plan

Files to modify:

  1. dataface/core/render/chart/decisions.py - Expand ColumnProfile with: is_ordinal: bool, null_ratio: float, distribution: str (categorical/continuous/unique/constant) - In _profile_columns(): compute null_ratio from sample; detect is_ordinal by checking distinct values against known ordinal patterns (days of week, months, T-shirt sizes, low/medium/high/critical, Likert scales); classify distribution using distinct-count/row-count ratio - Replace _CURRENCY_PATTERNS regex with value-pattern analysis (check for $ prefix, consistent decimal places, non-negative range) as primary signal, keep regex as secondary - Replace _PERCENTAGE_PATTERNS regex with value-range analysis (all values 0-1 or 0-100)

  2. dataface/core/compile/chart_type_detection.py - Use distribution to improve chart type heuristics (categorical int column → bar, not scatter) - Skip high-null columns when picking axis candidates

  3. dataface/core/render/chart/decisions.py — axis settings - When is_ordinal, apply ordinal sort order to axis instead of alphabetical - When null_ratio > 0.9, deprioritize column as axis candidate

  4. Tests — ordinal sort for known vocabularies, distribution classification, null-column skipping, currency/percentage detection without regex

Implementation Progress

Review Feedback

  • [ ] Review cleared