AI_CONTEXT beta health and readiness scorecard

ID	M1-AICONTEXT-007
Status	completed
Priority	p0
Milestone	m1-ft-analytics-analyst-pilot
Owner	data-ai-engineer-architect

Problem

When an analyst asks an AI agent to write a query or build a dashboard, the quality of the result depends entirely on the context the agent received — column descriptions, semantic types, relationship mappings, grain info. Right now there's no way to know whether that context is actually good for a given warehouse. Did the profiler detect the right semantic types? Are the dbt descriptions making it through the merge pipeline? Are the relationship edges sensible or garbage? The only signal is "the agent wrote a bad query" — by which point you're debugging the wrong layer. We need a way to evaluate context quality directly, against the actual warehouse the pilot analysts use.

Context

The catalog() MCP tool returns enriched table metadata: columns with semantic types, roles, descriptions (via priority merge), grain candidates, and relationship edges.
format_schema_context() produces the text that gets injected into AI agent system prompts.
The description merge engine selects from dbt_schema_yml > database_comment > curated > inferred sources.
The relationship detector produces confidence-scored join candidates.
There is currently no eval framework — the M2 agent eval loop task exists but is P1/not_started.

Possible Solutions

1. Agent eval suite that tests context quality through query outcomes

Run a set of "golden" analyst questions against the pilot warehouse and measure whether the agent produces correct SQL. If the agent joins wrong tables, uses wrong columns, or misunderstands grain — that's a context quality failure. This is the most realistic signal because it measures what actually matters: does the context produce good agent behavior?

Trade-off: Expensive to set up (need golden questions + expected SQL/results), slow to run, conflates context quality with prompt quality and model capability. Hard to attribute failures.

2. Recommended: Direct context audit script against pilot warehouse

Write a script that runs catalog() against the pilot BigQuery warehouse and produces a structured report:

Coverage: % of tables with descriptions, % of columns with semantic types, % of tables with grain detected, % of table pairs with relationship edges
Source quality: how many descriptions came from dbt_schema_yml vs inferred vs legacy (dbt-sourced = good, inferred-only on a dbt project = gap)
Relationship sanity: for known joins in the pilot schema, did the detector find them? Any false positives with is_recommended=True?
Obvious errors: columns with semantic_type=email that are actually IDs, tables with grain=daily that are actually event-level, etc.

This is essentially catalog() | audit() — a function that takes the existing MCP output and flags gaps. It's fast to build because all the data is already there. Output is a markdown report with pass/fail per table. Run it weekly (or on every inspect run) against the pilot warehouse.

The key metric is: what % of pilot tables have context good enough that an agent could write correct SQL against them? That's the go/no-go number.

3. Manual spot-check with analyst sessions

Have pilot analysts use the MCP tools for a week, collect their feedback on where the agent got confused, trace back to context gaps. No tooling needed.

Trade-off: Slow, not repeatable, depends on analyst availability. Good as a complement but not a primary signal.

Plan

Build audit_context() function that runs catalog against a connection and produces a structured quality report
Define pass/fail thresholds for pilot readiness (e.g., >80% tables with descriptions, >90% columns with semantic types, all known FK relationships detected)
Run against pilot BigQuery warehouse, fix the gaps it surfaces
Make it runnable as dft inspect audit so it can be re-run as context improves

Implementation Progress

Decisions

Chose Solution 2 (direct context audit against profiler output).
Location: dataface/ai/audit.py — audits profiler dicts without requiring a live DB connection.
CLI: dft inspect audit reads target/inspect.json and produces a scorecard.
Output formats: terminal (rich markdown), markdown file, JSON.
Default thresholds: 80% table descriptions, 70% column semantic types, 60% grain detection, 50% relationship coverage. All overridable via CLI flags.

Deliverables

dataface/ai/audit.py — TableAudit (per-table), AuditReport (aggregate), audit_context() entry point, to_markdown() renderer.
dataface/cli/commands/inspect.py — audit_command() wired to dft inspect audit.
dataface/cli/main.py — CLI registration with --input, --output, --format, --threshold-* options.
tests/ai/test_context_audit.py — 28 tests covering TableAudit, AuditReport, custom thresholds, markdown rendering, and golden fixture integration.

Metrics audited

Coverage: % tables with descriptions, % columns with semantic types, % tables with grain, % tables with relationships.
Source quality: description source breakdown (dbt_schema_yml / database_comment / curated / inferred / legacy).
Relationship sanity: per-relationship warnings, low-confidence flagging, recommended vs total counts.
Pass/fail: each metric checked against configurable thresholds; exit code 1 on failure.

Test results

30/30 audit tests pass.
Full suite: 2351 passed, 48 skipped, 3 xfailed.
Ruff, Black, Mypy all clean.

Review Feedback

Round 1: Fixed silent PASS on empty profiles, added format validation, removed redundant imports.
Round 2: Fixed exit code bug (json/markdown formats exiting 0 on failure), eliminated all # type: ignore comments, inlined render helper.
Round 3: Reduced audit_context from 69 to 35 lines by extracting _compute_metrics and _count_description_sources.
Round 4: Approved. Fixed remaining MEDIUM items (bare dict type annotation, redundant test import).
[x] Review cleared