Question-aware schema retrieval and narrowing

Objective

Build a simple file-and-CLI-first retrieval layer over inspect.json and dbt metadata that can search, rank, and isolate only the schema context needed for a question. Keep M2 focused on local artifacts and commands, but design the retrieval engine so it can later back a runtime search_context tool for agents without changing the core data model.

This initiative explicitly prioritizes narrowing quality and implementation simplicity over speed, indexing sophistication, or retrieval perfection. If a plain Python search function over JSON artifacts does the job, that is good enough for M2.

Why this is M2

M1 got us the raw context substrate:

target/inspect.json
compact schema formatting
catalog() browse/profile flows
basic dbt description merging

What is still missing is the question-aware layer between "all available metadata" and "the exact working set the SQL generator should see." That is an internal-adoption and design-partner problem, so M2 is the right milestone:

current schemas are still small enough that full-context dumping works often enough to unblock M1
repeated internal usage will expose where whole-schema prompting is noisy, redundant, or misleading
M2 is where we should make retrieval and narrowing explicit, measurable, and reusable

Scope

In scope for M2

build a local searchable corpus from inspect.json, dbt metadata, and lightweight docs
support high-recall search over that corpus
support a separate isolation step that produces a small question-scoped bundle
expose this through CLI and file artifacts first
make the bundle consumable by text-to-SQL eval and generation paths

Out of scope for M2

search-speed optimization work
sophisticated indexing infrastructure
retrieval tuning for large-scale corpora
a heavy embedding/vector system as the default path
broad external context connectors beyond the current dbt + inspect-based sources
open-ended runtime exploration of live data values
a mandatory MCP search_context tool in product surfaces
full semantic/business grounding across governed metrics

Deliverables

[ ] Searchable local corpus built from existing context artifacts.
[ ] Clear command surface for search, inspect, and bundle generation.
[ ] Isolation contract that hands generation a narrow working set instead of a full schema blob.
[ ] Downstream eval/generation integration for A/B comparison against full-context prompting.
[ ] Post-M2 tool path defined as a thin wrapper over the same retrieval engine.

Tasks

Build question-aware schema search and isolation CLI over inspect.json and dbt metadata — P1 — Create the local corpus, search commands, and bundle generation flow
Wire question-scoped context bundles into text-to-SQL eval backends — P1 — Compare retrieved-and-isolated context against current full-schema prompting
Compare text-to-SQL evals with question-aware retrieval vs full-context prompting — P1 — Run the direct A/B comparison between narrowed context and the current full-context baseline
Iterate on question-aware retrieval with interface and result experiments — P2 — Run a small post-A/B experiment matrix to improve search output and bundle usefulness

Design Thesis

The important split is:

build the corpus
retrieve broadly enough for recall
isolate aggressively enough for prompt usability
generate from the isolated bundle

That means this initiative is not "make schema prompts better." It is "stop making the generator do implicit retrieval inside one giant prompt."

For M2, the retriever can be intentionally boring:

simple Python over JSONL/JSON
straightforward lexical matching
CLI-first invocation
transparent heuristics the team can inspect and tweak

Do not over-rotate on performance or elegance before proving that narrowing helps the agent.

Relationship to other work

Builds on context-catalog-nimble profiling and MCP catalog work.
Feeds mcp-analyst-agent text-to-SQL eval and generation work.
Should stay compatible with the existing benchmark initiative and future non-one-shot generation tasks.