Spec
Design Goals
- Keep M2 retrieval simple, local, and file-native.
- Separate retrieval, isolation, and generation into explicit stages.
- Make the output small enough for prompt use but rich enough for SQL generation.
- Reuse existing
inspect.json and dbt metadata instead of inventing new metadata pipelines first.
- Design one retrieval engine that can later power a tool surface without changing the core artifact model.
- Prioritize narrowing quality over speed, indexing sophistication, or retrieval perfection.
Non-Goals (M2)
- Search-speed optimization or benchmarking work.
- Fancy indexing infrastructure beyond what simple Python/CLI code needs.
- Mandatory embeddings/vector retrieval.
- Large online retrieval service or database.
- Full semantic layer or business glossary resolver.
- Open-ended live data exploration during retrieval.
- A production-mandatory
search_context MCP tool.
M2 System Shape
1. Corpus build
Build a derived corpus from local artifacts:
target/inspect.json
- dbt schema metadata already available in-project
- lightweight local docs / curated definition files when present
Suggested output:
target/context/corpus.jsonl
target/context/manifest.json
The corpus is a derived artifact, not a source of truth.
2. Search
Search takes a natural-language question and returns ranked matches from the corpus.
Search responsibilities:
- high recall
- deterministic ranking
- explainable result scores
M2 implementation guidance:
- simple Python over JSON/JSONL is acceptable
- loading the corpus into memory is acceptable
- a naive field-weighted scorer is acceptable
- do not optimize prematurely
Search does not produce prompt-ready context directly.
3. Isolation
Isolation turns ranked hits into a small question-scoped bundle.
Isolation responsibilities:
- collapse many hits to a small working set
- trim irrelevant columns and docs
- preserve relationship hints across the retained tables
- produce a deterministic output contract for generation/evals
4. Generation consumer
Generation should receive either:
- the isolated bundle
- or, for compatibility, the existing full schema context fallback
The generator should not own retrieval logic.
Corpus Record Model
Required fields
Every corpus record should include:
kind
id
title
search_text
summary
source
payload
M2 record kinds
table
column
relationship
doc
Table payload
Recommended fields:
schema_name
table_name
description
row_count
primary_date_column
grain
source_paths
Column payload
Recommended fields:
schema_name
table_name
column_name
database_type
role
semantic_type
description
distinct_count
enum_values
null_percentage
min_value
max_value
Relationship payload
Recommended fields:
left_table
right_table
join_keys
multiplicity
fanout_risk
Doc payload
Recommended fields:
title
body
source_path
- optional
related_tables
- optional
related_columns
Ranking Model (M2)
M2 ranking should be deterministic and field-weighted.
Recommended score inputs:
- exact table-name token match
- exact column-name token match
- prefix / substring name match
- description/doc token overlap
- semantic type / role token overlap
- relationship proximity boost when multiple top hits connect
Avoid opaque ranking for M2. Result order should be inspectable and debuggable.
Isolation Contract
Bundle output
Suggested path:
target/context/bundles/<slug>.json
Bundle shape
Required top-level fields:
question
generated_at
search_results
selected_tables
relationships
bundle_text
metadata
selected_tables
Each selected table should include:
- table summary
- selected columns only
- compact descriptions only
- relationship references to other selected tables
bundle_text
The bundle should also provide a compact text representation that can be passed directly to the current generator without forcing the generator to understand a brand new schema format.
That means M2 can narrow the context without having to rewrite every generation consumer immediately.
CLI Surface
dft context build
Build or refresh the local derived corpus from available artifacts.
dft context search "<question>"
Return ranked corpus hits in text or JSON.
dft context show <id>
Return one record in detail.
dft context bundle "<question>"
Return or persist the isolated working set intended for generation.
Output modes
All commands should support:
- human-readable text
- machine-readable JSON
Integration Rules
M2 default
The retrieval engine is first consumed by eval and local generation flows, not by every agent surface immediately.
Fallback
If the corpus is unavailable or the command is not configured, existing full-schema behavior remains valid.
Future MCP tool use should call the same core search/isolation library and return the same records or bundle schema.
Validation and Measurement
The initiative is successful when:
- corpus build is deterministic and reproducible
- search returns useful ranked results on realistic questions
- bundle size is materially smaller than full schema dumps
- downstream text-to-SQL generation can consume the bundle without custom one-off glue
- evals can compare bundle-driven prompting versus full-context prompting
Acceptance Criteria
- A local corpus can be built from
inspect.json plus available dbt metadata.
- Search and bundle commands exist and work in text and JSON modes.
- Isolation is a separate stage from search.
- The bundle contains only a narrow working set rather than the full schema.
- Existing generation/eval code can consume the bundle through a stable interface.
- The architecture leaves room for a later
search_context tool without re-implementing retrieval.