Dataface Tasks

Inspector cleanup wave 1 architectural decomposition and contract hardening

IDINSPECT_PROFILER-INSPECTOR_CLEANUP_WAVE_1_ARCHITECTURAL_DECOMPOSITION_AND_CONTRACT_HARDENING
Statuscompleted
Priorityp1
Milestonem1-ft-analytics-analyst-pilot
Ownersr-engineer-architect
Initiativeinspector-cleanup-and-open-source-hardening
Completed bydave
Completed2026-03-24

Problem

Plan and execute a deeper inspector cleanup pass that decomposes oversized inspector modules, tightens internal APIs, improves naming and helper boundaries, and strengthens contract-oriented regression coverage for upcoming open-source release quality.

Context

  • The first maintainability pass reduced TableInspector._inspect_table_inner() into a smaller orchestration flow, but the inspector subsystem still has several oversized hotspots that will attract scrutiny during open-source review.
  • Current high-value complexity targets from a local scan:
  • dataface/core/inspect/semantic_detector.py — detector catalog is large and highly branchy, especially _detect_ip_address, _detect_region, and _detect_status.
  • dataface/core/inspect/connection.pyget_table_enrichment() centralizes multiple warehouse-specific behaviors in one method.
  • dataface/core/inspect/query_builder.pybuild_approximate_stats_query() and _build_comprehensive_stats_query() still carry a lot of dialect-specific branching and query-shape logic.
  • dataface/core/inspect/quality_detector.pyclassify() and _detect_role() still own a dense set of heuristics and output mutations.
  • The open-source quality bar here is not just “works”; the inspector internals should read like a clean reference implementation with clear boundaries, stable contracts, and obvious extension points.
  • Constraints:
  • Preserve existing inspect.json/API behavior unless a contract change is explicitly planned and documented.
  • Avoid deep behavior changes mixed with structural refactors in the same patch set.
  • Prefer small, reviewable cleanup waves with contract tests over a single repo-wide rewrite.
  • Keep the data-shape boundary intact: detection/inspection code owns semantics and profiling policy; renderers consume structured outputs only.

Possible Solutions

  • Large-bang inspector rewrite across modules in one branch.
  • Pros: maximum freedom to redesign boundaries.
  • Cons: high regression risk, difficult review, and poor fit for a subsystem that already has broad integration coverage and external consumers.
  • Wave-based cleanup initiative with contract-first refactors across the main hotspots. Recommended
  • Pros: lets us isolate architectural improvements by seam, preserve behavior with targeted contract tests, and keep each cleanup pass understandable to future open-source contributors.
  • Cons: requires discipline to keep each wave bounded and avoid letting “cleanup” turn into feature work.
  • Documentation-only hardening without code cleanup.
  • Pros: low immediate risk.
  • Cons: does not address the actual readability/extensibility debt in the subsystem internals.

Plan

  • Treat this as wave 1 of a broader inspector cleanup initiative.
  • Scope the first deep refactor pass around the most review-sensitive seams:
  • detector organization and repeated pattern helpers in semantic_detector.py
  • warehouse enrichment boundary cleanup in connection.py
  • query assembly decomposition in query_builder.py
  • classification pipeline cleanup in quality_detector.py
  • For each selected area:
  • identify the contract surface and current regression coverage
  • extract repeated logic into narrowly named helpers or internal structures
  • improve naming, helper locality, and mutation boundaries
  • add focused tests that lock down the refactor seam before or alongside the code change
  • Keep task execution on a fresh branch/worktree from latest main after the current PR is resolved.
  • Update this task worksheet during execution with concrete wave selection, touched files, and validation results.

Implementation Progress

  • 2026-03-13: Created under initiative inspector-cleanup-and-open-source-hardening to capture the broader refactor program rather than another isolated cleanup.
  • 2026-03-13: Initial scoping identified the next major cleanup candidates as semantic_detector.py, connection.py, query_builder.py, and quality_detector.py.
  • 2026-03-13: Deferred code execution until the current inspector refactor PR flow is fully resolved so the next wave can start from a clean branch and latest main.
  • 2026-03-22: Wave 1 execution started on branch codex/inspector-cleanup-wave-1-architectural-decomposition-and-contract-hardening.

Wave 1 changes (2026-03-22)

Contract tests added (tests/core/test_inspect_internal_contracts.py, 40 tests): - TestSemanticDetectorContract: return shape, tier ordering, known-type stability - TestQualityDetectorContract: axis domains, flag types, role detection - TestQueryBuilderContract: SQL shape validation across 4 dialects, batching, all query types - TestConnectionEnrichmentContract: schema shape, enrichment dict, dialect mapping

semantic_detector.py (1766→1466 lines, −300): - Moved 4 large reference data sets (_COMMON_COUNTRY_CODES, _COMMON_CURRENCIES, _COMMON_LANGUAGES, _US_STATES) from class attributes to compact module-level frozensets - Moved 6 temporal hint lists and _STATUS_HINTS/_STATUS_VALUES from class to module-level tuples/frozensets - Added _TEMPORAL_SUBTYPES lookup table for data-driven temporal subtype dispatch - Added type-safe helpers: _type_guard(), _name_hints(), _name_exact_or_suffix(), _str_samples(), _match_ratio() - Added per-type frozen type sets (_STRING_DB_TYPES, _INTEGER_DB_TYPES, etc.) for reuse

connection.py (954→972 lines, dialect dispatch refactor): - Extracted inline BigQuery enrichment into _get_bigquery_enrichment() - Extracted inline PostgreSQL enrichment into _get_postgres_enrichment() - Introduced _ENRICHMENT_DISPATCH class dict mapping dialect → enrichment method name - Introduced _PARTITION_ENRICHMENT_DIALECTS frozenset for post-enrichment partition logic - get_table_enrichment() now uses dispatch table (3 lines of dispatch logic) instead of 8-branch if/elif chain

query_builder.py (723→658 lines, −65): - Extracted _string_length_exprs(), _temporal_range_exprs(), _base_count_exprs(), _numeric_range_exprs() shared expression builders - _build_comprehensive_stats_query() and build_approximate_stats_query() now share expression helpers instead of duplicating column-type dispatching - No SQL output changes; all existing tests pass

quality_detector.py: No changes needed — already clean after prior passes.

Validation: 580 tests pass (31 skipped for warehouse-specific), 0 lint errors, 0 type errors.

Review Feedback

  • [ ] Review cleared