tasks/workstreams/context-catalog-nimble/initiatives/question-aware-schema-retrieval-and-narrowing/research.md

Research

Why this initiative exists

The current text-to-SQL stack still treats schema context mostly as a prebuilt blob:

That is a clean baseline, but it pushes retrieval, isolation, and generation into one step. The SOTA notes correctly argue that these should be separate stages.

What we already have

Existing metadata substrate

target/inspect.json already gives us a strong local artifact to build on:

Relevant implementation details:

Existing dbt-aware context

We already have partial dbt-aware enrichment:

That means M2 does not need to invent new raw metadata sources first. It can focus on shaping and retrieving the sources we already have.

What the public systems suggest

LinkAlign: separate retrieval from isolation

The most relevant lesson from LinkAlign is not "use embeddings." It is:

  1. retrieve broadly enough to preserve recall
  2. isolate aggressively before generation

This maps directly to our problem. Even if we keep the first retrieval implementation simple and lexical, we should still preserve the stage split.

Databao is especially relevant for us because it treats dbt artifacts as first-class context and exposes search_context as a runtime tool. The most portable ideas for Dataface are:

The important part for M2 is not hybrid retrieval itself. The important part is having a searchable local domain at all.

ReFoRCE: compress before generation

ReFoRCE reinforces a simpler point that applies even before we implement value exploration:

That means our M2 bundle should be intentionally smaller than full get_schema_context() output, even if the underlying source metadata remains rich.

Best M2 design under our constraints

Constraint 1: keep it simple

The user requirement here is explicit: prefer a simple file/CLI approach over a heavy service or indexing stack.

That rules out making embeddings, vector DBs, online retrieval infrastructure, or search-speed optimization the default M2 path.

Constraint 2: current schemas are not enormous

For the schemas we are actively working with right now, full-context dumps are often still acceptable. So the M2 system does not need to solve "web scale" retrieval. It needs to:

It does not need to be fast internally. It just needs to narrow context well enough that the agent sees less noise.

Constraint 3: future tool use still matters

Even though M2 should be file/CLI first, the output contract should be designed so that a future tool can simply call into the same engine and return the same records or bundle.

That argues for:

Stage A: build a local corpus

Build a derived corpus from local artifacts, likely under target/context/ or similar. Each record should represent one searchable unit, not one giant schema blob.

Good record kinds for M2:

Each record should carry:

M2 search should be deterministic and local. A field-weighted lexical ranker or even a plain Python scoring function over JSON records is probably enough for first pass.

Recommended ranking signals:

Why not embeddings first:

Stage C: isolation / bundle generation

This is the most important part. Search results are not the prompt. The bundle is the prompt input.

The isolation step should:

The output should be a question-scoped bundle with:

What information should be exposed

Always expose in the bundle

Expose only when relevant or in detailed modes

These are useful, but they should not be default prompt payload unless they help answer the specific question.

Avoid exposing by default

What commands agents and developers will want

M2 CLI surface

Recommended command set:

  1. dft context build Build or refresh the derived local corpus from inspect.json, dbt metadata, and lightweight docs.

  2. dft context search "<question>" Return ranked matches in text or JSON. Good for broad recall and human inspection.

  3. dft context show <id> Show one record in detail, for example a table, column, or relationship node.

  4. dft context bundle "<question>" Produce the narrow working set intended for SQL generation. This is the main M2 output.

  5. dft context bundle "<question>" --save Persist the bundle to disk so evals or generators can consume it reproducibly.

Useful optional flags:

Implementation note for M2:

Future tool shape after M2

The likely future tool surface is:

But these should be wrappers over the CLI/core library, not a second implementation.

How agents are likely to use this

Pattern 1: pre-bundle before generation

For most M2 flows, the outer orchestrator should call bundle first, then pass only that narrowed context to SQL generation.

This is the simplest path and probably the default one.

Pattern 2: search first when the question is ambiguous

If the question is vague or maps to multiple table families, an agent will want:

  1. search_context("monthly revenue by plan")
  2. inspect top results
  3. choose or regenerate a bundle
  4. then generate SQL

This is why search and bundle should be separate commands.

Pattern 3: fall back to full schema on small projects

Because some current projects are still small, we should allow a simple fallback:

Derived corpus

Suggested path:

Why:

Question bundle

Suggested path:

The bundle should be deterministic enough that eval runs can log it and compare full-context versus narrowed-context generation fairly.

Evaluation implications

This initiative should be measured by more than subjective prompt size reduction.

Useful metrics:

This is where it should connect to the existing benchmark/evals work rather than inventing a separate measurement stack.

Recommendation

For M2, the best design is:

That gives us a real retrieval-and-narrowing layer without overcommitting to infrastructure that our current schema sizes do not yet require.