The current text-to-SQL stack still treats schema context mostly as a prebuilt blob:
dataface/ai/schema_context.py:get_schema_context() formats all available cached/live schema into one compact text documentdataface/ai/generate_sql.py hands that blob plus the question to the generatordataface/ai/agent.py also appends the whole schema context to the system prompt when availableThat is a clean baseline, but it pushes retrieval, isolation, and generation into one step. The SOTA notes correctly argue that these should be separate stages.
target/inspect.json already gives us a strong local artifact to build on:
target/inspect.jsonRelevant implementation details:
dataface/core/inspect/storage.py stores all profiles in one file and explicitly calls out queryability via DuckDBdataface/ai/mcp/tools.py:catalog() already merges cached profiles with dbt descriptions and can return either schema browse or deep table profile outputdataface/ai/schema_context.py already knows how to format a compact text summary from this metadataWe already have partial dbt-aware enrichment:
apps/evals/sql/context.py that can load dbt package metadatainspect.json or dbt sources without forcing one single schema blob contractThat means M2 does not need to invent new raw metadata sources first. It can focus on shaping and retrieving the sources we already have.
The most relevant lesson from LinkAlign is not "use embeddings." It is:
This maps directly to our problem. Even if we keep the first retrieval implementation simple and lexical, we should still preserve the stage split.
Databao is especially relevant for us because it treats dbt artifacts as first-class context and exposes search_context as a runtime tool. The most portable ideas for Dataface are:
The important part for M2 is not hybrid retrieval itself. The important part is having a searchable local domain at all.
ReFoRCE reinforces a simpler point that applies even before we implement value exploration:
That means our M2 bundle should be intentionally smaller than full get_schema_context() output, even if the underlying source metadata remains rich.
The user requirement here is explicit: prefer a simple file/CLI approach over a heavy service or indexing stack.
That rules out making embeddings, vector DBs, online retrieval infrastructure, or search-speed optimization the default M2 path.
For the schemas we are actively working with right now, full-context dumps are often still acceptable. So the M2 system does not need to solve "web scale" retrieval. It needs to:
It does not need to be fast internally. It just needs to narrow context well enough that the agent sees less noise.
Even though M2 should be file/CLI first, the output contract should be designed so that a future tool can simply call into the same engine and return the same records or bundle.
That argues for:
Build a derived corpus from local artifacts, likely under target/context/ or similar. Each record should represent one searchable unit, not one giant schema blob.
Good record kinds for M2:
tablecolumnrelationshipdocmetric or definition only if we already have a cheap local sourceEach record should carry:
kindidtitleM2 search should be deterministic and local. A field-weighted lexical ranker or even a plain Python scoring function over JSON records is probably enough for first pass.
Recommended ranking signals:
Why not embeddings first:
This is the most important part. Search results are not the prompt. The bundle is the prompt input.
The isolation step should:
The output should be a question-scoped bundle with:
These are useful, but they should not be default prompt payload unless they help answer the specific question.
Recommended command set:
dft context build
Build or refresh the derived local corpus from inspect.json, dbt metadata, and lightweight docs.
dft context search "<question>"
Return ranked matches in text or JSON. Good for broad recall and human inspection.
dft context show <id>
Show one record in detail, for example a table, column, or relationship node.
dft context bundle "<question>"
Produce the narrow working set intended for SQL generation. This is the main M2 output.
dft context bundle "<question>" --save
Persist the bundle to disk so evals or generators can consume it reproducibly.
Useful optional flags:
--json--top-k--max-tables--max-columns-per-table--include-relationships--sourceImplementation note for M2:
The likely future tool surface is:
search_context(question, ...)show_context(id)bundle_context(question, ...)But these should be wrappers over the CLI/core library, not a second implementation.
For most M2 flows, the outer orchestrator should call bundle first, then pass only that narrowed context to SQL generation.
This is the simplest path and probably the default one.
If the question is vague or maps to multiple table families, an agent will want:
search_context("monthly revenue by plan")This is why search and bundle should be separate commands.
Because some current projects are still small, we should allow a simple fallback:
Suggested path:
target/context/corpus.jsonltarget/context/manifest.jsonWhy:
Suggested path:
target/context/bundles/<slug>.jsonThe bundle should be deterministic enough that eval runs can log it and compare full-context versus narrowed-context generation fairly.
This initiative should be measured by more than subjective prompt size reduction.
Useful metrics:
This is where it should connect to the existing benchmark/evals work rather than inventing a separate measurement stack.
For M2, the best design is:
inspect.json + dbt metadatabuild/search/show/bundle as file/CLI commandsThat gives us a real retrieval-and-narrowing layer without overcommitting to infrastructure that our current schema sizes do not yet require.