Decisions

ADR-001: Make retrieval file/CLI-first in M2

Status: Proposed
Decision: Build the question-aware retrieval layer as a local corpus + CLI flow first, not as an always-on service or mandatory MCP tool.
Rationale: Keeps the first version simple, debuggable, and aligned with current project scale.
Consequence: Runtime tool use is deferred, but the core retrieval contract must still be reusable later.

Status: Proposed
Decision: Search results are not passed directly to generation. The system must have an explicit isolation step that produces a question-scoped bundle.
Rationale: This is the core lesson from the gap analysis and from systems like LinkAlign.
Consequence: We need both a ranked-search contract and a bundle contract.

Status: Proposed
Decision: M2 builds on target/inspect.json, dbt schema metadata, and lightweight local docs before expanding to broader external context sources.
Rationale: We already have enough local metadata to improve narrowing meaningfully.
Consequence: M2 focuses on shaping and retrieving context, not on a new ingestion platform.

Status: Proposed
Decision: Start with field-weighted lexical/deterministic ranking rather than embeddings or hybrid vector search as the default.
Rationale: Current schema sizes are manageable, exact name matching is highly valuable, and deterministic scoring is easier to debug and validate.
Consequence: Embeddings may be added later if recall is insufficient, but they are not the baseline contract.

Status: Proposed
Decision: M2 will not optimize for search performance. A simple Python function over local JSON/JSONL artifacts is acceptable if it narrows context well for the agent.
Rationale: The current corpus sizes are small enough, and the main risk is overbuilding retrieval infrastructure before proving the narrowing contract is useful.
Consequence: If retrieval quality proves useful and corpus sizes grow, optimization can be a later follow-up rather than a design constraint now.

Status: Proposed
Decision: Build a structured derived corpus, likely JSONL plus manifest files under target/context/.
Rationale: Keeps the retriever inspectable, rebuildable, and easy to consume from both Python and CLI paths.
Consequence: The corpus becomes a cache/build artifact, not a hand-authored source of truth.

Status: Proposed
Decision: The question-scoped bundle should include structured records plus a compact text rendering.
Rationale: Current generation code already expects text context; a text form avoids forcing a simultaneous rewrite of every consumer.
Consequence: The bundle contract must preserve enough structure for future tooling while remaining easy to inject into prompts now.

Status: Proposed
Decision: Do not make a new agent tool the primary M2 deliverable. Instead, design the retrieval engine so a future tool can wrap it directly.
Rationale: Tool exposure before the retrieval contract is stable creates two moving parts at once.
Consequence: M2 should still define the likely future tool shapes and keep CLI/core logic reusable.

Status: Proposed
Decision: Retrieved-and-isolated context becomes a preferred path, not an all-or-nothing hard dependency.
Rationale: Some current projects are small enough that full schema context is still acceptable, and we need a safe fallback while the retriever matures.
Consequence: Consumers need a clear fallback rule when corpus or bundle generation is unavailable.

Status: Proposed
Decision: Judge the system by both retrieval metrics (top-k table/column hit, bundle inclusion) and downstream text-to-SQL quality.
Rationale: Retrieval that looks good in isolation but fails to help generation is not enough.
Consequence: This initiative should connect to existing eval work rather than inventing a disconnected success metric.