Plan activity page performance improvements with incremental GitHub sync and caching

ID	INFRA_TOOLING-PLAN_ACTIVITY_PAGE_PERFORMANCE_IMPROVEMENTS_WITH_INCREMENTAL_GITHUB_SYNC_AND_CACHING
Status	completed
Priority	p1
Milestone	m1-ft-analytics-analyst-pilot
Owner	sr-engineer-architect
Completed by	dave
Completed	2026-03-24

Problem

Create an implementation plan to reduce tasks activity page load time, currently around 10s. Investigate bottlenecks in GitHub data fetch and rendering, and design a smart strategy that avoids full-history queries on every load. Candidate approach: persist normalized activity history locally, fetch only new deltas since last cursor or timestamp, and serve page data from cached snapshots with background refresh and explicit staleness indicators. The plan must also answer whether the activity UI should paginate by time window (for example one week at a time by default) and how cached activity should be stored on disk so it remains bounded, inspectable, and cheap to update. Define measurable targets, instrumentation, rollout steps, and fallback behavior when GitHub is unavailable.

Context

/tasks/activity is currently useful but slow enough to break normal operator flow.
The activity page (tasks/activity/index.md) renders a day-by-day record of completed tasks and merged PRs via the {{ daily_activity_rollup() }} MkDocs macro, which calls render_daily_activity() in tasks/activity.py.
The page appears to be doing too much work per request, likely combining GitHub fetch, normalization, and render time.
A cache design that only speeds up fetch but still renders an unbounded full-history page may leave the page feeling slow or visually overloaded.
The plan should prefer a bounded, inspectable local format over a single ever-growing opaque blob.

Current data flow

collect_merged_prs() (around line 365) runs git log --first-parent --max-count=5000 to get merge commits, then for each PR calls _changed_paths_for_commit(), which spawns a git diff-tree subprocess. With a few hundred merged PRs this becomes a few hundred subprocess calls just for path detection.
_pull_request_title() (around line 138) may spawn an additional git show subprocess per merge-commit PR to extract the real title from the commit body.
collect_completed_tasks() (around line 513) reads every task file in workstreams/*/tasks/*.md, then for each completed task without completed_at frontmatter calls _blame_completed_line(), which spawns git blame --line-porcelain and may also call _commit_committer_name() with another git show.
No caching means every MkDocs build, including mkdocs serve live reload, reruns the full pipeline from scratch.

Key files

tasks/activity.py - collection and rendering logic
tasks/activity/index.md - page template
tasks/macros.py - macro registration for daily_activity_rollup

Constraints

MkDocs macro context runs during static site build, not a Django request.
Git subprocess calls are the dominant cost; task file parsing is comparatively cheap.
The page is read-only, so caching is operationally safe.
The plan should explicitly compare storage shapes such as:
one normalized append-only activity log plus derived indexes
time-bucketed cache files such as weekly JSON
a small embedded database or SQLite table if materially cleaner
The final design must remain correct after new PRs merge or tasks complete.

Possible Solutions

Recommended: combine incremental GitHub sync with bounded weekly pagination in the UI. Keep a normalized local activity store, fetch deltas only, precompute week buckets or equivalent indexes, and render one recent week by default with explicit previous/next navigation. This reduces both request-time fetch work and page weight while keeping the model simple.
Batch git operations without a bounded cache. This removes the worst subprocess hot spot, but warm loads still recompute everything and the page can still grow into an unbounded render.
Keep the full unpaginated page but cache the whole payload. Faster data fetch, but page size and visual noise continue to grow with history.
Use infinite scroll over a local cache. Potentially workable, but adds UI complexity and makes explicit time-window navigation less clear.
Use a single large JSON or Markdown cache file. Simple initially, but likely degrades as the file grows and becomes harder to inspect or refresh incrementally.

Plan

Identify current hot spots separately:
Git history/query time
normalization and aggregation time
template/render time
Use batched git reads to remove per-PR subprocess calls where possible:
update collect_merged_prs() to consume git log --name-only style output instead of calling _changed_paths_for_commit() per PR
avoid extra git show title lookups when the log body already contains enough information
Decide the canonical local activity store and justify it with bounded-growth rules:
define the on-disk format
define invalidation or refresh keys
keep storage inspectable and cheap to update
Decide the activity page windowing model:
default recent week view
previous/next week navigation
optional broader views only if they remain cheap
Define freshness semantics:
what is loaded synchronously on page build
what is refreshed incrementally
what staleness timestamp or badge the page shows
Capture implementation touch points:
tasks/activity.py
tasks/activity/index.md
tasks/macros.py
tests/tasks/test_activity.py
.gitignore only if a local cache path needs ignoring
Include rollout guidance so the page can switch to the faster path without losing historical correctness.

Implementation Progress

2026-03-24: Scope clarified to require a recommendation on time-window pagination, likely weekly, and on-disk cache or storage shape, not just generic incremental sync.
2026-03-24: Analysis completed for the current tasks/activity.py data flow. Confirmed the main hot spot is per-PR subprocess fan-out from _changed_paths_for_commit(), with secondary cost from blame-based task completion backfills and no caching layer for repeated MkDocs builds.
2026-03-24: Recommended direction is to combine batched git collection with bounded local caching and a weekly time-windowed activity view, preserving the richer framing from main while keeping the concrete findings from the original planning work.

QA Exploration

N/A - this is a planning task, not an implementation or UI validation task.

[x] QA exploration completed (or N/A for non-UI tasks)

Review Feedback

[ ] Review cleared