Question-aware schema retrieval and narrowing
Objective
Build a simple file-and-CLI-first retrieval layer over inspect.json and dbt metadata that can search, rank, and isolate only the schema context needed for a question. Keep M2 focused on local artifacts and commands, but design the retrieval engine so it can later back a runtime search_context tool for agents without changing the core data model.
This initiative explicitly prioritizes narrowing quality and implementation simplicity over speed, indexing sophistication, or retrieval perfection. If a plain Python search function over JSON artifacts does the job, that is good enough for M2.
Why this is M2
M1 got us the raw context substrate:
target/inspect.json- compact schema formatting
catalog()browse/profile flows- basic dbt description merging
What is still missing is the question-aware layer between "all available metadata" and "the exact working set the SQL generator should see." That is an internal-adoption and design-partner problem, so M2 is the right milestone:
- current schemas are still small enough that full-context dumping works often enough to unblock M1
- repeated internal usage will expose where whole-schema prompting is noisy, redundant, or misleading
- M2 is where we should make retrieval and narrowing explicit, measurable, and reusable
Scope
In scope for M2
- build a local searchable corpus from
inspect.json, dbt metadata, and lightweight docs - support high-recall search over that corpus
- support a separate isolation step that produces a small question-scoped bundle
- expose this through CLI and file artifacts first
- make the bundle consumable by text-to-SQL eval and generation paths
Out of scope for M2
- search-speed optimization work
- sophisticated indexing infrastructure
- retrieval tuning for large-scale corpora
- a heavy embedding/vector system as the default path
- broad external context connectors beyond the current dbt + inspect-based sources
- open-ended runtime exploration of live data values
- a mandatory MCP
search_contexttool in product surfaces - full semantic/business grounding across governed metrics
Deliverables
- [ ] Searchable local corpus built from existing context artifacts.
- [ ] Clear command surface for search, inspect, and bundle generation.
- [ ] Isolation contract that hands generation a narrow working set instead of a full schema blob.
- [ ] Downstream eval/generation integration for A/B comparison against full-context prompting.
- [ ] Post-M2 tool path defined as a thin wrapper over the same retrieval engine.
Tasks
- Build question-aware schema search and isolation CLI over inspect.json and dbt metadata — P1 — Create the local corpus, search commands, and bundle generation flow
- Wire question-scoped context bundles into text-to-SQL eval backends — P1 — Compare retrieved-and-isolated context against current full-schema prompting
- Compare text-to-SQL evals with question-aware retrieval vs full-context prompting — P1 — Run the direct A/B comparison between narrowed context and the current full-context baseline
- Iterate on question-aware retrieval with interface and result experiments — P2 — Run a small post-A/B experiment matrix to improve search output and bundle usefulness
Design Thesis
The important split is:
- build the corpus
- retrieve broadly enough for recall
- isolate aggressively enough for prompt usability
- generate from the isolated bundle
That means this initiative is not "make schema prompts better." It is "stop making the generator do implicit retrieval inside one giant prompt."
For M2, the retriever can be intentionally boring:
- simple Python over JSONL/JSON
- straightforward lexical matching
- CLI-first invocation
- transparent heuristics the team can inspect and tweak
Do not over-rotate on performance or elegance before proving that narrowing helps the agent.
Relationship to other work
- Builds on
context-catalog-nimbleprofiling and MCP catalog work. - Feeds
mcp-analyst-agenttext-to-SQL eval and generation work. - Should stay compatible with the existing benchmark initiative and future non-one-shot generation tasks.