Add persistent analyst memories and learned context

ID	MCP_ANALYST_AGENT-ADD_PERSISTENT_ANALYST_MEMORIES_AND_LEARNED_CONTEXT
Status	completed
Priority	p2
Milestone	m2-internal-adoption-design-partners
Owner	data-ai-engineer-architect
Initiative	ai-quality-experimentation-and-context-optimization
Completed by	dave
Completed	2026-03-18

Problem

Design and implement a memories file that accumulates knowledge from analyst queries — table quirks, column semantics, common join patterns, gotchas, and domain-specific conventions. Currently the Dataface AI system prompt is fully static (schema context + skills + yaml cheatsheet). There is no mechanism to learn from past interactions. Evaluate different memory strategies (append-only log, structured knowledge base, embedding-indexed retrieval) and measure their impact on SQL quality using the eval system. The goal is that the agent gets better at a specific warehouse over time, not just on each conversation.

Context

Current state: fully static system prompt

The Dataface AI system prompt is assembled in dataface/ai/agent.py:build_agent_system_prompt(): - Workflow skill (prompts/workflow.md) - Dashboard design skill (prompts/dashboard_design.md) - Dashboard review skill - YAML cheatsheet - Schema context from get_schema_context() - Tool guidance

There is zero per-project or per-warehouse learned context. Every conversation starts from scratch. The agent has no memory of: - Table quirks ("the created_at column in stg_salesforce__accounts is actually modified_at") - Column semantics ("revenue means ARR, not MRR, in this warehouse") - Common join patterns ("always join opportunities through accounts, not directly to contacts") - Gotchas ("the is_deleted soft-delete flag must be filtered in staging tables") - Domain conventions ("fiscal year starts in February")

Why this matters

An analyst who uses the same warehouse daily builds up this knowledge implicitly. An AI agent should too. The memories file is the mechanism — a persistent, growing knowledge base that gets injected into the system prompt alongside schema context.

Research: how production systems handle agent memory

The industry has converged on a multi-tier memory architecture modeled on human cognition:

Tier 1: Working memory (context window) — the current conversation. Already exists.

Tier 2: Semantic/procedural memory (knowledge base) — generalized facts, rules, and learned skills. Persists across sessions. This is what we need.

Tier 3: Episodic memory (event logs) — timestamped records of specific interactions. Useful for "what did we do last time?" questions. Optional for now.

Key systems to study before implementing:

Mem0 (mem0.ai) — pluggable memory layer that integrates with any agent. Extracts structured facts from conversations, stores as embeddings + optional knowledge graph. 48K GitHub stars, $24M Series A. Key insight: extract and structure facts rather than storing raw conversations. Newer info timestamp-weights over duplicates.
Letta/MemGPT — full agent runtime with three-tier memory: Core (in-context RAM), Recall (conversation history), Archival (cold storage). Agents actively self-edit their own memory blocks. 83% on LoCoMo benchmark. Key insight: the agent should manage its own memory, not have memory passively extracted.
Cursor's approach — uses .cursor/rules/ directory with structured .mdc files organized by domain. Project-level, not conversation-level. This is closest to what we'd want: a per-project knowledge base that's part of the repo.
Claude-mem — auto-captures tool usage, generates semantic summaries. 95% token savings across sessions. Key insight: selective forgetting matters — don't try to retain everything.

What the research says matters most: 1. Extract structured facts, not raw transcripts 2. Selective retrieval — inject only relevant memories, not everything 3. Timestamp and freshness — newer knowledge overrides older 4. The agent should be able to write/update its own memories 5. Human memory works by selective forgetting — production systems should too

Strategy recommendation: structured YAML, graduating to retrieval

Start with option 2 (structured YAML) — a memories.yml file in the dft project (alongside dataface.yml). Typed entries with tags:

memories:
  - type: column_semantic
    table: bi_core.accounts
    column: revenue
    note: "This is ARR, not MRR"
    created: 2026-03-17

  - type: join_pattern
    tables: [bi_core.opportunities, bi_core.accounts]
    note: "Always join opportunities through accounts, not directly to contacts"
    created: 2026-03-17

  - type: gotcha
    table: staging.stg_salesforce__accounts
    note: "created_at is actually modified_at — use _fivetran_synced for true creation"
    created: 2026-03-17

  - type: domain
    note: "Fiscal year starts in February"
    created: 2026-03-17

Why YAML over markdown: barely harder, but gives structure for filtering. Can always be dumped verbatim into the prompt as a starting approach. Makes it trivial to auto-generate memories from inspector output.

Graduate to retrieval when the file gets large (>100 entries or >4K tokens). At that point, embed the entries and retrieve top-k relevant ones per question. Mem0 or a simple local embedding index. But don't build this until you've proven memories help at all via eval experiments.

How memories get created

Manually by analysts ("remember that fiscal year starts in February" → agent writes to memories.yml)
Automatically from self-corrections (agent corrects a query → captures the correction as a memory)
From inspector/profile output (semantic types, distributions that are non-obvious)
The agent should have a tool to read/write memories — a remember MCP tool

Dependencies

Depends on the eval system being operational (runner + benchmark) to measure impact of different memory strategies
The memory injection point is build_agent_system_prompt() in dataface/ai/agent.py

Possible Solutions

Recommended: structured YAML memory ledger with bounded prompt injection. Keep a per-project memories.yml alongside dataface.yml and store typed entries for column semantics, join patterns, gotchas, and domain rules. Include metadata such as scope, type, source, confidence, created, updated, and status so the store is reviewable and can support selective injection. build_agent_system_prompt() reads the store, filters by warehouse/source/table/question context, and injects only a bounded top slice into the prompt. This is the best first step because it is deterministic, diffable, easy to author manually, and easy to validate with evals before adding retrieval complexity.
Structured store + retrieval layer once the file grows. Keep the YAML as the source of truth, then add tag- and embedding-based retrieval when the store gets large enough to exceed the prompt budget. This preserves the simple authoring model while letting us graduate to semantic lookup later. It is the right long-term shape, but it is too much machinery to build before we have proof that memory improves SQL quality.
Append-only episodic log only. Capture every correction or analyst note verbatim and search it later. This is useful as an audit trail, but it is weak as a primary memory format because it is noisy, hard to curate, and expensive to inject into prompts. It should be an auxiliary record at most, not the main mechanism.

Plan

Selected approach

Use a structured YAML memory ledger as the first implementation, with bounded prompt injection and explicit freshness metadata. Treat it as semantic/procedural memory, not raw transcript storage. Do not build retrieval until the evals show that memory is helping and the prompt budget becomes a real constraint.

Phased implementation

Phase 0: define the contract. Finalize the memory schema, the allowed entry types, and the file location for the project-level memory store. Confirm how a memory is scoped to a warehouse/source/table and how stale entries are marked inactive instead of deleted.
Phase 1: read-only injection. Add a small helper that loads memories.yml, filters/normalizes entries, and appends a compact memory block in build_agent_system_prompt() in dataface/ai/agent.py. Keep this read-only and deterministic so we can test the effect before adding write paths.
Phase 2: write path and capture rules. Add a future remember MCP tool so the agent can persist a new memory or update an existing one when it learns a stable fact. Gate this behind clear rules so we only capture durable facts, not transient conversation content.
Phase 3: retrieval and pruning. When the store grows beyond the prompt budget, add ranking/retrieval over the structured store, then prune or down-rank stale entries. Preserve the YAML as the canonical source of truth even if retrieval gets more sophisticated.

Likely file/tool touchpoints

dataface/ai/agent.py build_agent_system_prompt() for memory injection
a new helper under dataface/ai/ for loading, filtering, and formatting memories
dataface/ai/tool_schemas.py and dataface/ai/tools.py for a future remember MCP tool and any companion read tool
dataface/ai/mcp/tools.py for the underlying persistence implementation once writes are allowed
a project-level memories.yml next to dataface.yml in the dft project
task/eval harnesses under apps/evals/ and tests/evals/ for memory ablations and regression checks

Implementation Progress

Reviewed the task goal and the prompt assembly path in dataface/ai/agent.py.
Confirmed the current agent system prompt is still static: workflow skill, dashboard skills, YAML cheatsheet, schema context, and tool guidance only.
Checked the tool plumbing in dataface/ai/tool_schemas.py and dataface/ai/tools.py so the future remember tool can slot into the existing OpenAI/MCP dispatch path cleanly.
Chosen a concrete first design: structured YAML memory ledger with bounded injection, not a raw transcript log.
Implemented Phase 1 read-only injection on this branch.
Added dataface/ai/memories.py to load memories.yml from the project base directory, filter inactive/empty entries, and format a bounded ## Learned Context block for the system prompt.
Updated build_agent_system_prompt() to append the learned-context block after schema context when a project-scoped memory file exists.
Added targeted tests for memory loading/formatting and prompt injection.
Shipped the Phase 1 milestone: a persistent project memory file with bounded prompt injection, which closes the core task of adding learned project context to the agent.
Follow-on improvements such as write-path automation, retrieval/ranking, and larger eval studies should be treated as separate optimization tasks rather than blockers for this initial delivery.

QA Exploration

[x] QA exploration completed (or N/A for non-UI tasks)

N/A — this is core AI infrastructure, not UI. Validation is via eval experiments measuring memory impact.

Review Feedback

[x] Review cleared