Build question-aware schema search and isolation CLI over inspect.json and dbt metadata
Problem
Build a local file-and-CLI retrieval layer that composes inspect.json, dbt schema metadata, and lightweight docs into a searchable corpus, supports search/show/bundle commands, and returns a narrow working set for an AI question instead of a full schema dump.
Context
Today the project has rich metadata but no question-aware retrieval stage:
target/inspect.jsonstores profiled table metadatacatalog()can browse or deep-profile tablesformat_schema_context()can render compact text- text-to-SQL paths still mostly consume broad schema context rather than a narrow working set
This task should create the local retrieval substrate that sits between "all known metadata" and "the prompt the generator actually sees."
Important constraints:
- keep it file/CLI-first for M2
- use deterministic ranking first
- reuse existing
inspect.jsonand dbt metadata - produce a reusable bundle artifact, not just ad hoc terminal output
- do not spend time optimizing speed or building fancy indexes unless the simple version clearly fails
Possible Solutions
- Recommended: derived local corpus + deterministic search + bundle isolation
Build a derived corpus from
inspect.json, dbt metadata, and lightweight local docs. Addbuild,search,show, andbundlecommands. Use field-weighted lexical ranking for search, then isolate a small working set for generation.
Why this is recommended:
- matches the M2 simplicity requirement
- separates retrieval from isolation cleanly
- is easy to inspect and debug
- can later back an MCP tool without rewriting the core logic
- Make agents call
catalog()repeatedly and let the model browse manually.
Trade-off: minimal engineering, but retrieval stays implicit and prompt usage stays inconsistent.
- Start with embeddings/vector search.
Trade-off: may help later, but adds infrastructure and opacity before we know deterministic retrieval is insufficient.
- Keep optimizing the existing full schema formatter instead of building retrieval.
Trade-off: easier in the short term, but it keeps retrieval and isolation implicit and does not give agents a reusable way to narrow context.
Plan
- Define the derived corpus record schema for tables, columns, relationships, and docs.
- Build a local corpus artifact from
inspect.jsonand dbt metadata. - Add CLI commands:
-
dft context build-dft context search "<question>"-dft context show <id>-dft context bundle "<question>" - Implement deterministic field-weighted ranking. This can be a simple Python scoring function over in-memory JSON records.
- Implement bundle generation that:
- keeps top tables
- trims to top columns per table
- includes short descriptions and relationship hints
- emits both JSON and compact
bundle_text - Add focused tests for corpus build, ranking, and bundle output.
Likely files
- new retrieval module under
dataface/ai/ - CLI command wiring under
dataface/cli/commands/ - existing inspect/dbt metadata readers
- focused tests for search and bundle behavior
Explicit anti-goals for this task
- no vector DB
- no special indexing service
- no attempt to make search "fast" before proving it is useful
- no attempt to make the scorer perfect before agents are actually using the bundle
Implementation Progress
Not started.
QA Exploration
- [x] QA exploration completed (or N/A for non-UI tasks)
N/A - local CLI/context task.
Review Feedback
- [ ] Review cleared