Dataface Tasks

Build question-aware schema search and isolation CLI over inspect.json and dbt metadata

IDCONTEXT_CATALOG_NIMBLE-BUILD_QUESTION_AWARE_SCHEMA_SEARCH_AND_ISOLATION_CLI_OVER_INSPECT_JSON_AND_DBT_METADATA
Statusnot_started
Priorityp1
Milestonem2-internal-adoption-design-partners
Ownerdata-ai-engineer-architect
Initiativequestion-aware-schema-retrieval-and-narrowing

Problem

Build a local file-and-CLI retrieval layer that composes inspect.json, dbt schema metadata, and lightweight docs into a searchable corpus, supports search/show/bundle commands, and returns a narrow working set for an AI question instead of a full schema dump.

Context

Today the project has rich metadata but no question-aware retrieval stage:

  • target/inspect.json stores profiled table metadata
  • catalog() can browse or deep-profile tables
  • format_schema_context() can render compact text
  • text-to-SQL paths still mostly consume broad schema context rather than a narrow working set

This task should create the local retrieval substrate that sits between "all known metadata" and "the prompt the generator actually sees."

Important constraints:

  • keep it file/CLI-first for M2
  • use deterministic ranking first
  • reuse existing inspect.json and dbt metadata
  • produce a reusable bundle artifact, not just ad hoc terminal output
  • do not spend time optimizing speed or building fancy indexes unless the simple version clearly fails

Possible Solutions

  1. Recommended: derived local corpus + deterministic search + bundle isolation Build a derived corpus from inspect.json, dbt metadata, and lightweight local docs. Add build, search, show, and bundle commands. Use field-weighted lexical ranking for search, then isolate a small working set for generation.

Why this is recommended:

  • matches the M2 simplicity requirement
  • separates retrieval from isolation cleanly
  • is easy to inspect and debug
  • can later back an MCP tool without rewriting the core logic
  1. Make agents call catalog() repeatedly and let the model browse manually.

Trade-off: minimal engineering, but retrieval stays implicit and prompt usage stays inconsistent.

  1. Start with embeddings/vector search.

Trade-off: may help later, but adds infrastructure and opacity before we know deterministic retrieval is insufficient.

  1. Keep optimizing the existing full schema formatter instead of building retrieval.

Trade-off: easier in the short term, but it keeps retrieval and isolation implicit and does not give agents a reusable way to narrow context.

Plan

  1. Define the derived corpus record schema for tables, columns, relationships, and docs.
  2. Build a local corpus artifact from inspect.json and dbt metadata.
  3. Add CLI commands: - dft context build - dft context search "<question>" - dft context show <id> - dft context bundle "<question>"
  4. Implement deterministic field-weighted ranking. This can be a simple Python scoring function over in-memory JSON records.
  5. Implement bundle generation that: - keeps top tables - trims to top columns per table - includes short descriptions and relationship hints - emits both JSON and compact bundle_text
  6. Add focused tests for corpus build, ranking, and bundle output.

Likely files

  • new retrieval module under dataface/ai/
  • CLI command wiring under dataface/cli/commands/
  • existing inspect/dbt metadata readers
  • focused tests for search and bundle behavior

Explicit anti-goals for this task

  • no vector DB
  • no special indexing service
  • no attempt to make search "fast" before proving it is useful
  • no attempt to make the scorer perfect before agents are actually using the bundle

Implementation Progress

Not started.

QA Exploration

  • [x] QA exploration completed (or N/A for non-UI tasks)

N/A - local CLI/context task.

Review Feedback

  • [ ] Review cleared