Inspector cleanup wave 1 architectural decomposition and contract hardening
Problem
Plan and execute a deeper inspector cleanup pass that decomposes oversized inspector modules, tightens internal APIs, improves naming and helper boundaries, and strengthens contract-oriented regression coverage for upcoming open-source release quality.
Context
- The first maintainability pass reduced
TableInspector._inspect_table_inner()into a smaller orchestration flow, but the inspector subsystem still has several oversized hotspots that will attract scrutiny during open-source review. - Current high-value complexity targets from a local scan:
dataface/core/inspect/semantic_detector.py— detector catalog is large and highly branchy, especially_detect_ip_address,_detect_region, and_detect_status.dataface/core/inspect/connection.py—get_table_enrichment()centralizes multiple warehouse-specific behaviors in one method.dataface/core/inspect/query_builder.py—build_approximate_stats_query()and_build_comprehensive_stats_query()still carry a lot of dialect-specific branching and query-shape logic.dataface/core/inspect/quality_detector.py—classify()and_detect_role()still own a dense set of heuristics and output mutations.- The open-source quality bar here is not just “works”; the inspector internals should read like a clean reference implementation with clear boundaries, stable contracts, and obvious extension points.
- Constraints:
- Preserve existing
inspect.json/API behavior unless a contract change is explicitly planned and documented. - Avoid deep behavior changes mixed with structural refactors in the same patch set.
- Prefer small, reviewable cleanup waves with contract tests over a single repo-wide rewrite.
- Keep the data-shape boundary intact: detection/inspection code owns semantics and profiling policy; renderers consume structured outputs only.
Possible Solutions
- Large-bang inspector rewrite across modules in one branch.
- Pros: maximum freedom to redesign boundaries.
- Cons: high regression risk, difficult review, and poor fit for a subsystem that already has broad integration coverage and external consumers.
- Wave-based cleanup initiative with contract-first refactors across the main hotspots. Recommended
- Pros: lets us isolate architectural improvements by seam, preserve behavior with targeted contract tests, and keep each cleanup pass understandable to future open-source contributors.
- Cons: requires discipline to keep each wave bounded and avoid letting “cleanup” turn into feature work.
- Documentation-only hardening without code cleanup.
- Pros: low immediate risk.
- Cons: does not address the actual readability/extensibility debt in the subsystem internals.
Plan
- Treat this as wave 1 of a broader inspector cleanup initiative.
- Scope the first deep refactor pass around the most review-sensitive seams:
- detector organization and repeated pattern helpers in
semantic_detector.py - warehouse enrichment boundary cleanup in
connection.py - query assembly decomposition in
query_builder.py - classification pipeline cleanup in
quality_detector.py - For each selected area:
- identify the contract surface and current regression coverage
- extract repeated logic into narrowly named helpers or internal structures
- improve naming, helper locality, and mutation boundaries
- add focused tests that lock down the refactor seam before or alongside the code change
- Keep task execution on a fresh branch/worktree from latest
mainafter the current PR is resolved. - Update this task worksheet during execution with concrete wave selection, touched files, and validation results.
Implementation Progress
- 2026-03-13: Created under initiative
inspector-cleanup-and-open-source-hardeningto capture the broader refactor program rather than another isolated cleanup. - 2026-03-13: Initial scoping identified the next major cleanup candidates as
semantic_detector.py,connection.py,query_builder.py, andquality_detector.py. - 2026-03-13: Deferred code execution until the current inspector refactor PR flow is fully resolved so the next wave can start from a clean branch and latest
main. - 2026-03-22: Wave 1 execution started on branch
codex/inspector-cleanup-wave-1-architectural-decomposition-and-contract-hardening.
Wave 1 changes (2026-03-22)
Contract tests added (tests/core/test_inspect_internal_contracts.py, 40 tests):
- TestSemanticDetectorContract: return shape, tier ordering, known-type stability
- TestQualityDetectorContract: axis domains, flag types, role detection
- TestQueryBuilderContract: SQL shape validation across 4 dialects, batching, all query types
- TestConnectionEnrichmentContract: schema shape, enrichment dict, dialect mapping
semantic_detector.py (1766→1466 lines, −300):
- Moved 4 large reference data sets (_COMMON_COUNTRY_CODES, _COMMON_CURRENCIES, _COMMON_LANGUAGES, _US_STATES) from class attributes to compact module-level frozensets
- Moved 6 temporal hint lists and _STATUS_HINTS/_STATUS_VALUES from class to module-level tuples/frozensets
- Added _TEMPORAL_SUBTYPES lookup table for data-driven temporal subtype dispatch
- Added type-safe helpers: _type_guard(), _name_hints(), _name_exact_or_suffix(), _str_samples(), _match_ratio()
- Added per-type frozen type sets (_STRING_DB_TYPES, _INTEGER_DB_TYPES, etc.) for reuse
connection.py (954→972 lines, dialect dispatch refactor):
- Extracted inline BigQuery enrichment into _get_bigquery_enrichment()
- Extracted inline PostgreSQL enrichment into _get_postgres_enrichment()
- Introduced _ENRICHMENT_DISPATCH class dict mapping dialect → enrichment method name
- Introduced _PARTITION_ENRICHMENT_DIALECTS frozenset for post-enrichment partition logic
- get_table_enrichment() now uses dispatch table (3 lines of dispatch logic) instead of 8-branch if/elif chain
query_builder.py (723→658 lines, −65):
- Extracted _string_length_exprs(), _temporal_range_exprs(), _base_count_exprs(), _numeric_range_exprs() shared expression builders
- _build_comprehensive_stats_query() and build_approximate_stats_query() now share expression helpers instead of duplicating column-type dispatching
- No SQL output changes; all existing tests pass
quality_detector.py: No changes needed — already clean after prior passes.
Validation: 580 tests pass (31 skipped for warehouse-specific), 0 lint errors, 0 type errors.
Review Feedback
- [ ] Review cleared