Research

Why this initiative exists

The current text-to-SQL stack still treats schema context mostly as a prebuilt blob:

dataface/ai/schema_context.py:get_schema_context() formats all available cached/live schema into one compact text document
dataface/ai/generate_sql.py hands that blob plus the question to the generator
dataface/ai/agent.py also appends the whole schema context to the system prompt when available

That is a clean baseline, but it pushes retrieval, isolation, and generation into one step. The SOTA notes correctly argue that these should be separate stages.

What we already have

Existing metadata substrate

target/inspect.json already gives us a strong local artifact to build on:

stable file output under target/inspect.json
row counts, grain hints, primary date, semantic types, roles
distinct counts, enum values, null percentages, min/max
optional relationship data baked back into the file
queryable JSON structure

Relevant implementation details:

dataface/core/inspect/storage.py stores all profiles in one file and explicitly calls out queryability via DuckDB
dataface/ai/mcp/tools.py:catalog() already merges cached profiles with dbt descriptions and can return either schema browse or deep table profile output
dataface/ai/schema_context.py already knows how to format a compact text summary from this metadata

Existing dbt-aware context

We already have partial dbt-aware enrichment:

dbt schema descriptions merged into catalog/profile results
eval-time context providers in apps/evals/sql/context.py that can load dbt package metadata
a pattern for building context from either inspect.json or dbt sources without forcing one single schema blob contract

That means M2 does not need to invent new raw metadata sources first. It can focus on shaping and retrieving the sources we already have.

What the public systems suggest

LinkAlign: separate retrieval from isolation

The most relevant lesson from LinkAlign is not "use embeddings." It is:

retrieve broadly enough to preserve recall
isolate aggressively before generation

This maps directly to our problem. Even if we keep the first retrieval implementation simple and lexical, we should still preserve the stage split.

Databao: separate context building from answer-time search

Databao is especially relevant for us because it treats dbt artifacts as first-class context and exposes search_context as a runtime tool. The most portable ideas for Dataface are:

keep context construction separate from runtime reasoning
make dbt metadata first-class
support search over a persistent local context domain
let tool use come later as a thin wrapper over that domain

The important part for M2 is not hybrid retrieval itself. The important part is having a searchable local domain at all.

ReFoRCE: compress before generation

ReFoRCE reinforces a simpler point that applies even before we implement value exploration:

do not pass the whole world to the generator
compress and prune before the final generation step

That means our M2 bundle should be intentionally smaller than full get_schema_context() output, even if the underlying source metadata remains rich.

Best M2 design under our constraints

Constraint 1: keep it simple

The user requirement here is explicit: prefer a simple file/CLI approach over a heavy service or indexing stack.

That rules out making embeddings, vector DBs, online retrieval infrastructure, or search-speed optimization the default M2 path.

Constraint 2: current schemas are not enormous

For the schemas we are actively working with right now, full-context dumps are often still acceptable. So the M2 system does not need to solve "web scale" retrieval. It needs to:

reduce obvious prompt bloat
remove irrelevant tables/columns
give agents a fast path to the most likely working set

It does not need to be fast internally. It just needs to narrow context well enough that the agent sees less noise.

Constraint 3: future tool use still matters

Even though M2 should be file/CLI first, the output contract should be designed so that a future tool can simply call into the same engine and return the same records or bundle.

That argues for:

stable on-disk corpus records
stable bundle schema
CLI commands that can emit both text and JSON

Recommended M2 architecture

Stage A: build a local corpus

Build a derived corpus from local artifacts, likely under target/context/ or similar. Each record should represent one searchable unit, not one giant schema blob.

Good record kinds for M2:

table
column
relationship
doc
metric or definition only if we already have a cheap local source

Each record should carry:

stable kind
stable id
source file and provenance
a compact title
searchable text
a compact structured payload used later for isolation

Stage B: high-recall search

M2 search should be deterministic and local. A field-weighted lexical ranker or even a plain Python scoring function over JSON records is probably enough for first pass.

Recommended ranking signals:

exact table-name match
exact column-name match
token overlap against descriptions/docs
boosts for semantic type and role matches
boosts for dbt model names and aliases
optional relationship boosts when two top hits connect cleanly

Why not embeddings first:

corpus size is still manageable
lexical signals are very strong for schema names
deterministic scoring is easier to debug
it keeps the implementation file-native and cheap
optimization is not the priority; narrowing quality is

Stage C: isolation / bundle generation

This is the most important part. Search results are not the prompt. The bundle is the prompt input.

The isolation step should:

take the ranked hits
collapse them to a small set of tables
keep only the top columns relevant to the question
attach only the most useful table-level metadata
optionally attach one-hop relationships among the retained tables

The output should be a question-scoped bundle with:

the original question
ranked source hits
selected tables
selected columns per table
short summaries
relationship edges among retained tables
retrieval/exclusion metadata for debugging

What information should be exposed

Always expose in the bundle

table name and schema
short table description
selected columns with type
role / semantic type when present
row count when known
primary date column when known
grain hint when known
relationship edges among retained tables

Expose only when relevant or in detailed modes

enum values / top values
null percentages
min/max ranges
provenance detail
long dbt docs bodies

These are useful, but they should not be default prompt payload unless they help answer the specific question.

Avoid exposing by default

full histograms
every column in every matched table
verbose raw inspect metadata
repeated text duplicated across schema, docs, and dbt sources

What commands agents and developers will want

M2 CLI surface

Recommended command set:

dft context build Build or refresh the derived local corpus from inspect.json, dbt metadata, and lightweight docs.
dft context search "<question>" Return ranked matches in text or JSON. Good for broad recall and human inspection.
dft context show <id> Show one record in detail, for example a table, column, or relationship node.
dft context bundle "<question>" Produce the narrow working set intended for SQL generation. This is the main M2 output.
dft context bundle "<question>" --save Persist the bundle to disk so evals or generators can consume it reproducibly.

Useful optional flags:

--json
--top-k
--max-tables
--max-columns-per-table
--include-relationships
--source

Implementation note for M2:

these commands can be backed by very simple Python code
reading JSON/JSONL into memory and scoring records directly is acceptable
avoid introducing special index formats unless a clear need emerges

Future tool shape after M2

The likely future tool surface is:

search_context(question, ...)
maybe show_context(id)
maybe bundle_context(question, ...)

But these should be wrappers over the CLI/core library, not a second implementation.

How agents are likely to use this

Pattern 1: pre-bundle before generation

For most M2 flows, the outer orchestrator should call bundle first, then pass only that narrowed context to SQL generation.

This is the simplest path and probably the default one.

Pattern 2: search first when the question is ambiguous

If the question is vague or maps to multiple table families, an agent will want:

search_context("monthly revenue by plan")
inspect top results
choose or regenerate a bundle
then generate SQL

This is why search and bundle should be separate commands.

Pattern 3: fall back to full schema on small projects

Because some current projects are still small, we should allow a simple fallback:

if corpus/build artifacts are absent, or if the project is tiny, full schema context remains acceptable
retrieval should be a better default, not an all-or-nothing dependency

Recommended artifact shapes

Derived corpus

Suggested path:

target/context/corpus.jsonl
target/context/manifest.json

Why:

easy to inspect
easy to diff locally
easy to query from Python or DuckDB
easy to rebuild

Question bundle

Suggested path:

target/context/bundles/<slug>.json

The bundle should be deterministic enough that eval runs can log it and compare full-context versus narrowed-context generation fairly.

Evaluation implications

This initiative should be measured by more than subjective prompt size reduction.

Useful metrics:

top-k hit rate for gold tables
bundle contains gold table
bundle contains needed gold columns
prompt size reduction versus full schema dump
downstream text-to-SQL pass rate when using bundle versus full context

This is where it should connect to the existing benchmark/evals work rather than inventing a separate measurement stack.

Recommendation

For M2, the best design is:

build a deterministic local corpus from inspect.json + dbt metadata
expose build/search/show/bundle as file/CLI commands
treat isolation as a first-class stage, not just a nicer formatter
wire bundle consumption into eval/generation paths
defer runtime MCP search tooling until the corpus and bundle contract prove themselves
explicitly defer speed/indexing optimization until narrowing quality is proven valuable

That gives us a real retrieval-and-narrowing layer without overcommitting to infrastructure that our current schema sizes do not yet require.

tasks/workstreams/context-catalog-nimble/initiatives/question-aware-schema-retrieval-and-narrowing/research.md

Research

Why this initiative exists

What we already have

Existing metadata substrate

Existing dbt-aware context

What the public systems suggest

LinkAlign: separate retrieval from isolation

Databao: separate context building from answer-time search

ReFoRCE: compress before generation

Best M2 design under our constraints

Constraint 1: keep it simple

Constraint 2: current schemas are not enormous

Constraint 3: future tool use still matters

Recommended M2 architecture

Stage A: build a local corpus

Stage B: high-recall search

Stage C: isolation / bundle generation

What information should be exposed

Always expose in the bundle

Expose only when relevant or in detailed modes

Avoid exposing by default

What commands agents and developers will want

M2 CLI surface

Future tool shape after M2

How agents are likely to use this

Pattern 1: pre-bundle before generation

Pattern 2: search first when the question is ambiguous

Pattern 3: fall back to full schema on small projects

Recommended artifact shapes

Derived corpus

Question bundle

Evaluation implications

Recommendation