Inspector cleanup wave 1 architectural decomposition and contract hardening

ID	INSPECT_PROFILER-INSPECTOR_CLEANUP_WAVE_1_ARCHITECTURAL_DECOMPOSITION_AND_CONTRACT_HARDENING
Status	completed
Priority	p1
Milestone	m1-ft-analytics-analyst-pilot
Owner	sr-engineer-architect
Initiative	inspector-cleanup-and-open-source-hardening
Completed by	dave
Completed	2026-03-24

Problem

Plan and execute a deeper inspector cleanup pass that decomposes oversized inspector modules, tightens internal APIs, improves naming and helper boundaries, and strengthens contract-oriented regression coverage for upcoming open-source release quality.

Context

The first maintainability pass reduced TableInspector._inspect_table_inner() into a smaller orchestration flow, but the inspector subsystem still has several oversized hotspots that will attract scrutiny during open-source review.
Current high-value complexity targets from a local scan:
dataface/core/inspect/semantic_detector.py — detector catalog is large and highly branchy, especially _detect_ip_address, _detect_region, and _detect_status.
dataface/core/inspect/connection.py — get_table_enrichment() centralizes multiple warehouse-specific behaviors in one method.
dataface/core/inspect/query_builder.py — build_approximate_stats_query() and _build_comprehensive_stats_query() still carry a lot of dialect-specific branching and query-shape logic.
dataface/core/inspect/quality_detector.py — classify() and _detect_role() still own a dense set of heuristics and output mutations.
The open-source quality bar here is not just “works”; the inspector internals should read like a clean reference implementation with clear boundaries, stable contracts, and obvious extension points.
Constraints:
Preserve existing inspect.json/API behavior unless a contract change is explicitly planned and documented.
Avoid deep behavior changes mixed with structural refactors in the same patch set.
Prefer small, reviewable cleanup waves with contract tests over a single repo-wide rewrite.
Keep the data-shape boundary intact: detection/inspection code owns semantics and profiling policy; renderers consume structured outputs only.

Possible Solutions

Large-bang inspector rewrite across modules in one branch.
Pros: maximum freedom to redesign boundaries.
Cons: high regression risk, difficult review, and poor fit for a subsystem that already has broad integration coverage and external consumers.
Wave-based cleanup initiative with contract-first refactors across the main hotspots. Recommended
Pros: lets us isolate architectural improvements by seam, preserve behavior with targeted contract tests, and keep each cleanup pass understandable to future open-source contributors.
Cons: requires discipline to keep each wave bounded and avoid letting “cleanup” turn into feature work.
Documentation-only hardening without code cleanup.
Pros: low immediate risk.
Cons: does not address the actual readability/extensibility debt in the subsystem internals.

Plan

Treat this as wave 1 of a broader inspector cleanup initiative.
Scope the first deep refactor pass around the most review-sensitive seams:
detector organization and repeated pattern helpers in semantic_detector.py
warehouse enrichment boundary cleanup in connection.py
query assembly decomposition in query_builder.py
classification pipeline cleanup in quality_detector.py
For each selected area:
identify the contract surface and current regression coverage
extract repeated logic into narrowly named helpers or internal structures
improve naming, helper locality, and mutation boundaries
add focused tests that lock down the refactor seam before or alongside the code change
Keep task execution on a fresh branch/worktree from latest main after the current PR is resolved.
Update this task worksheet during execution with concrete wave selection, touched files, and validation results.

Implementation Progress

2026-03-13: Created under initiative inspector-cleanup-and-open-source-hardening to capture the broader refactor program rather than another isolated cleanup.
2026-03-13: Initial scoping identified the next major cleanup candidates as semantic_detector.py, connection.py, query_builder.py, and quality_detector.py.
2026-03-13: Deferred code execution until the current inspector refactor PR flow is fully resolved so the next wave can start from a clean branch and latest main.
2026-03-22: Wave 1 execution started on branch codex/inspector-cleanup-wave-1-architectural-decomposition-and-contract-hardening.

Wave 1 changes (2026-03-22)

Contract tests added (tests/core/test_inspect_internal_contracts.py, 40 tests): - TestSemanticDetectorContract: return shape, tier ordering, known-type stability - TestQualityDetectorContract: axis domains, flag types, role detection - TestQueryBuilderContract: SQL shape validation across 4 dialects, batching, all query types - TestConnectionEnrichmentContract: schema shape, enrichment dict, dialect mapping

semantic_detector.py (1766→1466 lines, −300): - Moved 4 large reference data sets (_COMMON_COUNTRY_CODES, _COMMON_CURRENCIES, _COMMON_LANGUAGES, _US_STATES) from class attributes to compact module-level frozensets - Moved 6 temporal hint lists and _STATUS_HINTS/_STATUS_VALUES from class to module-level tuples/frozensets - Added _TEMPORAL_SUBTYPES lookup table for data-driven temporal subtype dispatch - Added type-safe helpers: _type_guard(), _name_hints(), _name_exact_or_suffix(), _str_samples(), _match_ratio() - Added per-type frozen type sets (_STRING_DB_TYPES, _INTEGER_DB_TYPES, etc.) for reuse

connection.py (954→972 lines, dialect dispatch refactor): - Extracted inline BigQuery enrichment into _get_bigquery_enrichment() - Extracted inline PostgreSQL enrichment into _get_postgres_enrichment() - Introduced _ENRICHMENT_DISPATCH class dict mapping dialect → enrichment method name - Introduced _PARTITION_ENRICHMENT_DIALECTS frozenset for post-enrichment partition logic - get_table_enrichment() now uses dispatch table (3 lines of dispatch logic) instead of 8-branch if/elif chain

query_builder.py (723→658 lines, −65): - Extracted _string_length_exprs(), _temporal_range_exprs(), _base_count_exprs(), _numeric_range_exprs() shared expression builders - _build_comprehensive_stats_query() and build_approximate_stats_query() now share expression helpers instead of duplicating column-type dispatching - No SQL output changes; all existing tests pass

quality_detector.py: No changes needed — already clean after prior passes.

Validation: 580 tests pass (31 skipped for warehouse-specific), 0 lint errors, 0 type errors.

Review Feedback

[ ] Review cleared