Spec

Design Goals

Keep M2 retrieval simple, local, and file-native.
Separate retrieval, isolation, and generation into explicit stages.
Make the output small enough for prompt use but rich enough for SQL generation.
Reuse existing inspect.json and dbt metadata instead of inventing new metadata pipelines first.
Design one retrieval engine that can later power a tool surface without changing the core artifact model.
Prioritize narrowing quality over speed, indexing sophistication, or retrieval perfection.

Non-Goals (M2)

Search-speed optimization or benchmarking work.
Fancy indexing infrastructure beyond what simple Python/CLI code needs.
Mandatory embeddings/vector retrieval.
Large online retrieval service or database.
Full semantic layer or business glossary resolver.
Open-ended live data exploration during retrieval.
A production-mandatory search_context MCP tool.

M2 System Shape

1. Corpus build

Build a derived corpus from local artifacts:

target/inspect.json
dbt schema metadata already available in-project
lightweight local docs / curated definition files when present

Suggested output:

target/context/corpus.jsonl
target/context/manifest.json

The corpus is a derived artifact, not a source of truth.

2. Search

Search takes a natural-language question and returns ranked matches from the corpus.

Search responsibilities:

high recall
deterministic ranking
explainable result scores

M2 implementation guidance:

simple Python over JSON/JSONL is acceptable
loading the corpus into memory is acceptable
a naive field-weighted scorer is acceptable
do not optimize prematurely

Search does not produce prompt-ready context directly.

3. Isolation

Isolation turns ranked hits into a small question-scoped bundle.

Isolation responsibilities:

collapse many hits to a small working set
trim irrelevant columns and docs
preserve relationship hints across the retained tables
produce a deterministic output contract for generation/evals

4. Generation consumer

Generation should receive either:

the isolated bundle
or, for compatibility, the existing full schema context fallback

The generator should not own retrieval logic.

Corpus Record Model

Required fields

Every corpus record should include:

kind
id
title
search_text
summary
source
payload

M2 record kinds

table
column
relationship
doc

Table payload

Recommended fields:

schema_name
table_name
description
row_count
primary_date_column
grain
source_paths

Column payload

Recommended fields:

schema_name
table_name
column_name
database_type
role
semantic_type
description
distinct_count
enum_values
null_percentage
min_value
max_value

Relationship payload

Recommended fields:

left_table
right_table
join_keys
multiplicity
fanout_risk

Doc payload

Recommended fields:

title
body
source_path
optional related_tables
optional related_columns

Ranking Model (M2)

M2 ranking should be deterministic and field-weighted.

Recommended score inputs:

exact table-name token match
exact column-name token match
prefix / substring name match
description/doc token overlap
semantic type / role token overlap
relationship proximity boost when multiple top hits connect

Avoid opaque ranking for M2. Result order should be inspectable and debuggable.

Isolation Contract

Bundle output

Suggested path:

target/context/bundles/<slug>.json

Bundle shape

Required top-level fields:

question
generated_at
search_results
selected_tables
relationships
bundle_text
metadata

`selected_tables`

Each selected table should include:

table summary
selected columns only
compact descriptions only
relationship references to other selected tables

`bundle_text`

The bundle should also provide a compact text representation that can be passed directly to the current generator without forcing the generator to understand a brand new schema format.

That means M2 can narrow the context without having to rewrite every generation consumer immediately.

CLI Surface

`dft context build`

Build or refresh the local derived corpus from available artifacts.

`dft context search "<question>"`

Return ranked corpus hits in text or JSON.

`dft context show <id>`

Return one record in detail.

`dft context bundle "<question>"`

Return or persist the isolated working set intended for generation.

Output modes

All commands should support:

human-readable text
machine-readable JSON

Integration Rules

M2 default

The retrieval engine is first consumed by eval and local generation flows, not by every agent surface immediately.

Fallback

If the corpus is unavailable or the command is not configured, existing full-schema behavior remains valid.

Future tool compatibility

Future MCP tool use should call the same core search/isolation library and return the same records or bundle schema.

Validation and Measurement

The initiative is successful when:

corpus build is deterministic and reproducible
search returns useful ranked results on realistic questions
bundle size is materially smaller than full schema dumps
downstream text-to-SQL generation can consume the bundle without custom one-off glue
evals can compare bundle-driven prompting versus full-context prompting

Acceptance Criteria

A local corpus can be built from inspect.json plus available dbt metadata.
Search and bundle commands exist and work in text and JSON modes.
Isolation is a separate stage from search.
The bundle contains only a narrow working set rather than the full schema.
Existing generation/eval code can consume the bundle through a stable interface.
The architecture leaves room for a later search_context tool without re-implementing retrieval.

tasks/workstreams/context-catalog-nimble/initiatives/question-aware-schema-retrieval-and-narrowing/spec.md

Spec

Design Goals

Non-Goals (M2)

M2 System Shape

1. Corpus build

2. Search

3. Isolation

4. Generation consumer

Corpus Record Model

Required fields

M2 record kinds

Table payload

Column payload

Relationship payload

Doc payload

Ranking Model (M2)

Isolation Contract

Bundle output

Bundle shape

selected_tables

bundle_text

CLI Surface

dft context build

dft context search &quot;&lt;question&gt;&quot;

dft context show &lt;id&gt;

dft context bundle &quot;&lt;question&gt;&quot;

Output modes

Integration Rules

M2 default

Fallback

Future tool compatibility

Validation and Measurement

Acceptance Criteria

`selected_tables`

`bundle_text`

`dft context build`

`dft context search "<question>"`

`dft context show <id>`

`dft context bundle "<question>"`