Incremental dft inspect with lineage-aware change detection
Problem
On a large dbt project, dft inspect will profile every table on every run — even if nothing changed. For warehouses with 50+ models this is slow and wasteful. It should detect which tables are stale and only re-profile those.
Context
- Each profile in
inspect.jsonalready hasprofiled_attimestamp - dbt
target/manifest.jsoncontains full DAG with parent/child relationships per model - dbt model files have filesystem modification dates
- If a parent model changes, all downstream children are potentially stale even if their own SQL didn't change
- Non-dbt projects have no lineage info — fall back to simpler staleness checks
- Depends on:
dft inspectcomplete catalog builder task (batch profiling must exist first)
Possible Solutions
A. Manifest + file mtime comparison — Recommended
For dbt projects:
1. Load target/manifest.json for the DAG
2. For each model, compare max(model_file_mtime, max(parent_profiled_at)) against profiled_at in inspect.json
3. If model file or any upstream parent was modified after the stored profile, mark as stale
4. Only profile stale tables
For non-dbt projects:
1. Compare table's profiled_at against a configurable max-age (default: 24h) or just re-profile all (no lineage available)
- Pros: Fast, no DB queries needed for freshness check, leverages dbt's existing DAG
- Cons: File mtime can be unreliable across git operations (mitigate: also check file content hash)
B. Hash model SQL content
Hash the compiled SQL for each model and store in inspect.json. Compare on next run.
- Pros: More reliable than mtime
- Cons: Requires
dbt compileto have run, more complex
Plan
Approach A, with content hash as enhancement.
- Store staleness metadata in inspect.json — add
model_file_hashandprofiled_atto each table entry (profiled_at already exists) - Load dbt manifest — parse
target/manifest.jsonfor model DAG, extractdepends_onedges - Build staleness graph — for each table, walk up the DAG. A table is stale if: - No profile exists in inspect.json - Model file changed (mtime or hash differs) - Any upstream parent is stale (transitive)
- Skip fresh tables — only call
TableInspector.inspect_table()for stale entries --forceflag — bypasses all staleness checks, re-profiles everything- Print skip summary — "Profiled 5 tables, skipped 45 unchanged"
- Non-dbt fallback — no manifest available, fall back to max-age check or profile all
Files to modify:
- dataface/cli/commands/inspect.py — add --force flag, staleness logic
- dataface/core/inspect/storage.py — store model_file_hash alongside profiles
- New: dataface/core/inspect/staleness.py — DAG walk + freshness comparison
Implementation Progress
- [x] New
dataface/core/inspect/staleness.py—StalenessChecker+StalenessResultdataclass - Loads
target/manifest.jsonfor dbt DAG, buildsname -> nodeindex - Loads
target/inspect.jsonfor stored profiles withprofiled_atandmodel_file_hash - dbt mode: compares SHA-256 of model SQL file against stored hash, walks
depends_onedges for transitive staleness - Non-dbt fallback: max-age check (default 24h) against
profiled_at --forceflag bypasses all checks- [x]
dataface/core/inspect/storage.py—save_inspection()accepts optionalmodel_file_hashkwarg, persists it ininspect.json - [x]
dataface/cli/commands/inspect.py—inspect_command()runs staleness check before profiling; skips fresh tables with a message; computes and storesmodel_file_hashon save - [x]
dataface/cli/main.py—--forcetyper option wired todft inspect table - [x]
dataface/core/inspect/__init__.py— exportsStalenessChecker,StalenessResult - [x]
tests/core/test_inspect_staleness.py— 16 tests covering: no-profile stale, max-age stale/fresh, force override, model hash match/mismatch, transitive upstream staleness, diamond DAG, legacy profile without hash, non-dbt fallback, hash_model_file helper - [x] All 65 inspect tests pass, ruff clean, mypy clean
Key design decision: Used content hash (SHA-256) instead of mtime for model file comparison. More reliable across git operations, and the task plan recommended this as an enhancement over pure mtime.
Review Feedback
- [ ] Review cleared