Build tasks DuckDB SQL metrics pipeline for milestone dashboards

ID	M1-DFAC-003
Status	completed
Priority	p1
Milestone	m1-ft-analytics-analyst-pilot
Owner	data-analysis-evangelist-ai-training
Completed by	dave
Completed	2026-03-24

Problem

Milestone progress tracking in tasks is manual and not queryable — status counts are maintained by hand in Markdown macros, and there is no way to run SQL against planning data or generate reproducible visualizations. This means milestone dashboards are brittle, go stale easily, and cannot be validated programmatically. The dashboard factory workstream itself lacks the kind of data-driven progress views it is supposed to enable for others.

Context

Source of truth is Markdown frontmatter across tasks/milestones/*.md, tasks/workstreams/*/index.md, and tasks/workstreams/*/tasks/*.md (400 tasks across 11 workstreams, 7 milestones).
DuckDB and PyArrow are already project dependencies (duckdb>=0.9.0, pyarrow>=14.0.1).
dft render supports CSV-backed queries and HTML output — inline values type is unsupported in this flow, so metrics must be emitted as CSV files.
The existing tasks/frontmatter.py provides parse_frontmatter() for YAML extraction.
just tasks serve (tasks server) is completed and available — generated HTML can be linked/embedded there. MkDocs embedding remains optional.
Generated artifacts go under tasks/.generated/ (git-ignored).

Possible Solutions

Python script with DuckDB SQL metrics (Recommended): Single build_plan_dataset.py script parses frontmatter → Parquet, runs DuckDB SQL for rollups → CSV, then dft render produces static HTML. Minimal moving parts, uses existing project dependencies, deterministic outputs.
dbt model approach: Model planning data as dbt sources and run metrics through dbt. Heavier infrastructure, requires dbt project setup for internal data that isn’t warehouse-backed.
Python-only aggregation (no DuckDB): Compute metrics in pure Python. Loses SQL queryability — the whole point is making planning data SQL-addressable.

Plan

[x] Write TDD tests for parsing and metrics (13 tests covering parse, export, DuckDB rollups)
[x] Implement tasks/tools/build_plan_dataset.py with frontmatter → Parquet export
[x] Add DuckDB SQL metrics: milestone rollups (total/completed/pct_complete/in_progress/blocked/open) and workstream-level breakdowns
[x] Export metrics as both Parquet and CSV for dft render consumption
[x] Create tasks/dashboards/milestone_header.yml dashboard template (progress bar + task status stacked bar)
[x] Wire dft render into build script to produce tasks/.generated/charts/milestones/milestone_header.html
[x] Add just plans-data Justfile recipe
[x] Add tasks/.gitignore for .generated/ directory
[x] Verify end-to-end: just plans-data produces all artifacts from live data

Implementation Progress

2026-03-23: Full pipeline implemented and tested.

Files added/changed:

File	Description
`tasks/tools/build_plan_dataset.py`	Frontmatter → Parquet export, DuckDB SQL metrics, `dft render` chart generation
`tasks/dashboards/milestone_header.yml`	Dataface dashboard YAML: progress bar + task status stacked bar
`tasks/.gitignore`	Ignores `.generated/` directory
`tests/core/test_build_plan_dataset.py`	13 TDD tests covering parsing, Parquet export, and DuckDB metrics
`Justfile`	Added `just plans-data` recipe

Generated artifacts (from just plans-data):

tasks/.generated/
├── parquet/
│   ├── milestones.parquet
│   ├── workstreams.parquet
│   ├── tasks.parquet
│   ├── milestone_metrics.parquet
│   ├── milestone_metrics.csv
│   ├── workstream_metrics.parquet
│   └── workstream_metrics.csv
└── charts/milestones/
    └── milestone_header.html

Verified against live data: 400 tasks, 7 milestones, 11 workstreams. M1 pilot shows 76.9% complete (90/117 tasks).

Out of scope: - Historical trend snapshots/backfill - Rich multi-chart dashboards beyond compact header summary - MkDocs macro integration (generated HTML can be linked/embedded separately)

Review Feedback

[ ] Review cleared

2026-03-22 Triage Decision

Status set to planned (active backlog, not obsolete): implementation artifacts (build_plan_dataset.py, Parquet pipeline, milestone header render integration) are not in repo yet, and no task PR exists. Keep in backlog rather than marking done/cancelled.