Increase profiler semantic coverage
Problem
The profiler's semantic type detection covers 27 detectors as of M1, but significant gaps remain: geographic coordinates, JSON-structured strings, encoded IDs (UUIDs in non-standard formats), currency codes, and domain-specific patterns (e.g., Fivetran connector IDs, Stripe charge IDs) are all classified as generic string or numeric types. For analysts, these misclassifications mean the profile report provides no useful context for columns that are semantically rich — a column of lat/long pairs looks the same as a column of random floats. For AI agent workflows that consume profile data to generate queries or suggest charts, low semantic coverage means the agent operates on impoverished metadata. Expanding detector coverage and improving detection confidence for ambiguous columns is necessary for the profiler to deliver meaningful value beyond basic statistics.
Context
- The profiler already detects a substantial baseline of semantic types, but important analyst- and agent-relevant categories still collapse into generic string or numeric buckets.
- Core detector logic lives in the inspect subsystem, and downstream consumers include profile UIs, schema-context formatting, chart decisions, and agent workflows.
- M2 should improve semantic depth in a disciplined way: add high-value detectors, confidence handling, and tests without turning the profiler into an unbounded pattern catalog.
Possible Solutions
- A - Add many new regex-heavy detectors quickly across all missed categories: broad coverage, but likely brittle and hard to maintain.
- B - Recommended: prioritize a small set of high-signal semantic gaps, add detector and confidence logic with fixtures, and validate downstream usefulness before expanding further.
- C - Push semantic depth entirely into AI post-processing instead of the profiler: lower implementation cost, but loses deterministic reusable metadata.
Plan
- Rank the highest-value missing semantic categories by analyst impact, agent usefulness, and detection feasibility.
- Implement the first detector wave in the inspect semantic-detection path with confidence-aware outputs and clear precedence rules.
- Update downstream formatting or consumer code only where richer semantic labels materially improve profile or agent output.
- Add representative fixtures and tests for true positives, common false positives, and ambiguous cases to keep detector growth controlled.
Implementation Progress
-
Confirm scope and acceptance with milestone owner.
-
Milestone readiness signal is updated.
-
Track blockers and mitigation owner.
Review Feedback
- [ ] Review cleared