Build question-aware schema search and isolation CLI over inspect.json and dbt metadata

ID	CONTEXT_CATALOG_NIMBLE-BUILD_QUESTION_AWARE_SCHEMA_SEARCH_AND_ISOLATION_CLI_OVER_INSPECT_JSON_AND_DBT_METADATA
Status	not_started
Priority	p1
Milestone	m2-internal-adoption-design-partners
Owner	data-ai-engineer-architect
Initiative	question-aware-schema-retrieval-and-narrowing

Problem

Build a local file-and-CLI retrieval layer that composes inspect.json, dbt schema metadata, and lightweight docs into a searchable corpus, supports search/show/bundle commands, and returns a narrow working set for an AI question instead of a full schema dump.

Context

Today the project has rich metadata but no question-aware retrieval stage:

target/inspect.json stores profiled table metadata
catalog() can browse or deep-profile tables
format_schema_context() can render compact text
text-to-SQL paths still mostly consume broad schema context rather than a narrow working set

This task should create the local retrieval substrate that sits between "all known metadata" and "the prompt the generator actually sees."

Important constraints:

keep it file/CLI-first for M2
use deterministic ranking first
reuse existing inspect.json and dbt metadata
produce a reusable bundle artifact, not just ad hoc terminal output
do not spend time optimizing speed or building fancy indexes unless the simple version clearly fails

Possible Solutions

Recommended: derived local corpus + deterministic search + bundle isolation Build a derived corpus from inspect.json, dbt metadata, and lightweight local docs. Add build, search, show, and bundle commands. Use field-weighted lexical ranking for search, then isolate a small working set for generation.

Why this is recommended:

matches the M2 simplicity requirement
separates retrieval from isolation cleanly
is easy to inspect and debug
can later back an MCP tool without rewriting the core logic

Make agents call catalog() repeatedly and let the model browse manually.

Trade-off: minimal engineering, but retrieval stays implicit and prompt usage stays inconsistent.

Start with embeddings/vector search.

Trade-off: may help later, but adds infrastructure and opacity before we know deterministic retrieval is insufficient.

Keep optimizing the existing full schema formatter instead of building retrieval.

Trade-off: easier in the short term, but it keeps retrieval and isolation implicit and does not give agents a reusable way to narrow context.

Plan

Define the derived corpus record schema for tables, columns, relationships, and docs.
Build a local corpus artifact from inspect.json and dbt metadata.
Add CLI commands: - dft context build - dft context search "<question>" - dft context show <id> - dft context bundle "<question>"
Implement deterministic field-weighted ranking. This can be a simple Python scoring function over in-memory JSON records.
Implement bundle generation that: - keeps top tables - trims to top columns per table - includes short descriptions and relationship hints - emits both JSON and compact bundle_text
Add focused tests for corpus build, ranking, and bundle output.

Likely files

new retrieval module under dataface/ai/
CLI command wiring under dataface/cli/commands/
existing inspect/dbt metadata readers
focused tests for search and bundle behavior

Explicit anti-goals for this task

no vector DB
no special indexing service
no attempt to make search "fast" before proving it is useful
no attempt to make the scorer perfect before agents are actually using the bundle

Implementation Progress

Not started.

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A - local CLI/context task.

Review Feedback

[ ] Review cleared