Dataface Tasks

Static semantic type propagation through SQL queries via SQLGlot

IDCONTEXT_CATALOG_NIMBLE-STATIC_SEMANTIC_TYPE_PROPAGATION_THROUGH_SQL_QUERIES_VIA_SQLGLOT
Statusnot_started
Priorityp1
Milestonem2-internal-adoption-design-partners
Ownerdata-ai-engineer-architect

Problem

The profiler (Layers 1–5) detects rich semantic types on base table columns — currency_amount, email, created_at, status, etc. — but this knowledge is trapped at the base table level. When a dft YAML query transforms those columns (SUM(order_total), MONTH(created_at)), the output columns lose their semantic types. The chart renderer must re-guess types via regex patterns on column names (_CURRENCY_PATTERNS in decisions.py), which is brittle and limited to two types (currency and percentage).

This means charts cannot automatically format currency, detect temporal axes, render emails as links, or apply ordinal sort — unless the column name happens to match a regex.

Context

  • Profiler semantic types: 43+ types detected by semantic_detector.py — see ai_notes/ai/AI_CONTEXT_ARCHITECTURE.md#layer-5
  • SQLGlot Expression.meta: Every AST node has meta: dict[str, Any] that survives deep copies and is excluded from hash/equality. Already used internally for "nonnull", "query_type".
  • SQLGlot annotate_types: Propagates SQL types (INT, DECIMAL, etc.) through the AST via post-order walk. Does NOT propagate meta.
  • Current chart type guessing: decisions.py uses _CURRENCY_PATTERNS regex and _PERCENTAGE_PATTERNS regex on column names + value heuristics.
  • Independence from ASQL: The propagation engine uses SQLGlot directly to parse any SQL dialect. It does not depend on ASQL's schema or compiler — it's a standalone utility in dataface/core/inspect/.
  • Related: Issue #377 (SQLGlot column lineage) covers lineage tracking; this task covers semantic type propagation specifically.
  • Companion task: On-demand result-set profiling in chart decisions pipeline (dft-core workstream, m2) — handles the rendering boundary where data is in memory.
  • Full design doc: ai_notes/features/SEMANTIC_TYPE_PROPAGATION.md — includes three-tier propagation model, Tier 2 rules for distribution/completeness/is_non_negative, and the static-vs-on-demand architecture.

Possible Solutions

Recommended: Expression.meta + custom post-pass. Ride alongside SQLGlot's existing type system rather than extending DataType.Type (which is a closed enum). Seed semantic metadata onto column nodes from a profiled schema (dict of inspector output), then propagate with a custom propagate_profiles() pass that understands which SQL operations preserve semantic types. Works with any SQL dialect SQLGlot supports — no dependency on ASQL.

Alternative: Extend DataType.Type with custom types (requires forking or monkeypatching SQLGlot — fragile, not recommended by Toby).

Alternative: Do all propagation at the dbt DAG level by running the profiler on every model (expensive, defeats the purpose).

Plan

Three-tier propagation model (Tier 1: semantic identity, Tier 2: structural properties with transform-aware rules, Tier 3: float confidence scores that degrade through transforms). See design doc for full detail.

Phase 1: Tier 1 (semantic identity) 1. New: dataface/core/inspect/propagation.pyProfiledColumn dataclass, ProfiledSchema type alias, schema_from_inspections() factory, propagate_profiles() entry point with Tier 1 rules. Uses SQLGlot directly — no ASQL dependency. 2. dataface/core/execute/adapters/base.py — Add column_semantics: dict[str, dict] | None to QueryResult. 3. dataface/core/execute/executor.py — Cache and expose column_semantics alongside column_descriptions. 4. dataface/core/render/chart/decisions.py — Check column_semantics first, fall back to regex heuristics. 5. dataface/core/render/chart/pipeline.py and rendering.py — Thread column_semantics through resolve_chart().

Phase 2: Tier 2 (structural properties) 6. Add GROUP BY → distribution rules (grouped column becomes unique, aggregated measure stays continuous). 7. Add JOIN → completeness rules (LEFT JOIN right-side may downgrade). 8. Add WHERE → completeness upgrade rules (IS NOT NULL → complete).

Phase 3: Tier 3 (float confidence scores) 9. Carry has_outliers, is_skewed, is_sequential, is_incremental as float scores that degrade through transforms (× penalty factors).

Tests — Propagation rules per tier; GROUP BY distribution change; LEFT JOIN completeness downgrade; end-to-end from profiled schema through SQLGlot to chart formatting.

Implementation Progress

Review Feedback

  • [ ] Review cleared