Dataface Tasks

Incremental dft inspect with lineage-aware change detection

IDCONTEXT_CATALOG_NIMBLE-INCREMENTAL_DFT_INSPECT_WITH_LINEAGE_AWARE_CHANGE_DETECTION
Statuscompleted
Priorityp1
Milestonem1-ft-analytics-analyst-pilot
Ownerdata-ai-engineer-architect
Completed bydave
Completed2026-03-22

Problem

On a large dbt project, dft inspect will profile every table on every run — even if nothing changed. For warehouses with 50+ models this is slow and wasteful. It should detect which tables are stale and only re-profile those.

Context

  • Each profile in inspect.json already has profiled_at timestamp
  • dbt target/manifest.json contains full DAG with parent/child relationships per model
  • dbt model files have filesystem modification dates
  • If a parent model changes, all downstream children are potentially stale even if their own SQL didn't change
  • Non-dbt projects have no lineage info — fall back to simpler staleness checks
  • Depends on: dft inspect complete catalog builder task (batch profiling must exist first)

Possible Solutions

For dbt projects: 1. Load target/manifest.json for the DAG 2. For each model, compare max(model_file_mtime, max(parent_profiled_at)) against profiled_at in inspect.json 3. If model file or any upstream parent was modified after the stored profile, mark as stale 4. Only profile stale tables

For non-dbt projects: 1. Compare table's profiled_at against a configurable max-age (default: 24h) or just re-profile all (no lineage available)

  • Pros: Fast, no DB queries needed for freshness check, leverages dbt's existing DAG
  • Cons: File mtime can be unreliable across git operations (mitigate: also check file content hash)

B. Hash model SQL content

Hash the compiled SQL for each model and store in inspect.json. Compare on next run.

  • Pros: More reliable than mtime
  • Cons: Requires dbt compile to have run, more complex

Plan

Approach A, with content hash as enhancement.

  1. Store staleness metadata in inspect.json — add model_file_hash and profiled_at to each table entry (profiled_at already exists)
  2. Load dbt manifest — parse target/manifest.json for model DAG, extract depends_on edges
  3. Build staleness graph — for each table, walk up the DAG. A table is stale if: - No profile exists in inspect.json - Model file changed (mtime or hash differs) - Any upstream parent is stale (transitive)
  4. Skip fresh tables — only call TableInspector.inspect_table() for stale entries
  5. --force flag — bypasses all staleness checks, re-profiles everything
  6. Print skip summary — "Profiled 5 tables, skipped 45 unchanged"
  7. Non-dbt fallback — no manifest available, fall back to max-age check or profile all

Files to modify: - dataface/cli/commands/inspect.py — add --force flag, staleness logic - dataface/core/inspect/storage.py — store model_file_hash alongside profiles - New: dataface/core/inspect/staleness.py — DAG walk + freshness comparison

Implementation Progress

  • [x] New dataface/core/inspect/staleness.pyStalenessChecker + StalenessResult dataclass
  • Loads target/manifest.json for dbt DAG, builds name -> node index
  • Loads target/inspect.json for stored profiles with profiled_at and model_file_hash
  • dbt mode: compares SHA-256 of model SQL file against stored hash, walks depends_on edges for transitive staleness
  • Non-dbt fallback: max-age check (default 24h) against profiled_at
  • --force flag bypasses all checks
  • [x] dataface/core/inspect/storage.pysave_inspection() accepts optional model_file_hash kwarg, persists it in inspect.json
  • [x] dataface/cli/commands/inspect.pyinspect_command() runs staleness check before profiling; skips fresh tables with a message; computes and stores model_file_hash on save
  • [x] dataface/cli/main.py--force typer option wired to dft inspect table
  • [x] dataface/core/inspect/__init__.py — exports StalenessChecker, StalenessResult
  • [x] tests/core/test_inspect_staleness.py — 16 tests covering: no-profile stale, max-age stale/fresh, force override, model hash match/mismatch, transitive upstream staleness, diamond DAG, legacy profile without hash, non-dbt fallback, hash_model_file helper
  • [x] All 65 inspect tests pass, ruff clean, mypy clean

Key design decision: Used content hash (SHA-256) instead of mtime for model file comparison. More reliable across git operations, and the task plan recommended this as an enhancement over pure mtime.

Review Feedback

  • [ ] Review cleared