Dataface Tasks

Host Dataface on Fivetran GCP

IDM1-INFRA-001
Statuscompleted
Priorityp0
Milestonem1-ft-analytics-analyst-pilot
Ownerhead-of-engineering
Completed bydave
Completed2026-03-26

Problem

The deploy infrastructure for hosting Dataface on GCP Cloud Run is functionally in place, and the first Fivetran-hosted runtime has already been deployed to internal-dataface-eng. The remaining work is hardening and operational polish: static-site publication, Workload Identity Federation migration, monitoring/alerts, internal-access/domain setup, and landing the local deploy fixes into main.

Dataface needs a canonical Fivetran-hosted runtime for the internal analyst pilot (M1). Prior to this work, the only deploy path was Railway (Playground only). The pilot requires all three services (Playground, Cloud/Django, A lIe) running on Fivetran-controlled GCP infrastructure with proper secrets management, IAM, and rollback capability.

Context

  • A first hosted runtime already exists on Fivetran GCP, so the remaining work is not greenfield deploy setup but operational hardening and repeatable ownership.
  • This task now sits at the boundary between deployment reliability, secrets/config, monitoring, and the runbook needed to operate the hosted environment safely.
  • Because the task is blocked, the plan should make the remaining blockers explicit rather than pretending deployment itself is still undefined.

Possible Solutions

  1. Railway — Already used for Playground. Quick but no Fivetran ownership, limited secrets/IAM integration, no multi-service support. Rejected for pilot.
  2. GKE (Kubernetes) — Full-featured but heavy operational overhead for a 3-service internal pilot. Rejected as overkill at this stage.
  3. Cloud Run + Artifact Registry — Serverless, auto-scaling, simple deploy model, native GCP IAM and Secret Manager integration. Chosen for pilot.

Plan

Cloud Run with Artifact Registry, deployed via GitHub Actions via manual dispatch for runtime services and automatic deploy for static sites.

Architecture: - 3 Cloud Run services: dataface-playground (public), dataface-cloud (IAM-gated), dataface-alie (IAM-gated) - Static sites: Docs + Tasks on GCS buckets (separate workflow) - Artifact Registry: Docker images tagged with git short SHA - Secrets: GCP Secret Manager, injected at deploy time via --set-secrets - Database: Cloud SQL PostgreSQL for Django (via DATABASE_URL env var)

Implementation Progress

  • Confirm target runtime architecture and ownership boundaries.
  • Implement deploy path, secrets wiring, and environment promotion flow.
  • Publish runbooks for deploy, rollback, and incident triage.
  • Provision Fivetran GCP project and wire GitHub secrets.
  • First successful deploy of all three services to Fivetran GCP.
  • Prepare GitHub Actions auth for Workload Identity Federation.
  • Add uptime/error-budget monitoring and alerting.

  • Internal pilot environment can be deployed from main with documented steps.

  • Deployed and running on a Fivetran-owned GCP project.
  • Runtime health and failure signals are visible in monitoring/alerts.
  • Rollback path is tested and documented.

  • Coordinate with cloud-suite and integrations-platform for auth/network prerequisites.

  • Remaining blocker: internal domains / access model and any live GCP changes that require fresh operator auth in gcloud.

  • Related task: tasks/workstreams/integrations-platform/tasks/task-m1-fivetran-gcp-deploy-path.md

What's already built and merged to main:

Dockerfiles (all 3 services): - Dockerfile.cloud — Django via gunicorn, runs migrate + collectstatic on startup - Dockerfile.playground — FastAPI via uvicorn - Dockerfile.alie — FastAPI via uvicorn - All use ghcr.io/astral-sh/uv:python3.13-bookworm-slim base, non-root user

Justfile deploy recipes: - just deploy-cloud-run-service <name> <dockerfile> — generic deploy (build, push to Artifact Registry, gcloud run deploy) - just deploy-playground / just deploy-cloud / just deploy-alie / just deploy-all-services - just build-static-sites / just deploy-static-sites — MkDocs → GCS buckets - just publish / just publish-setup — Artifact Registry package publishing

GitHub Actions workflows: - .github/workflows/deploy-cloud-run.yml — triggers on push to main or manual dispatch; matrix strategy deploys all 3 services; uses SA JSON key auth - .github/workflows/deploy-static-sites.yml — triggers on push to main (docs/plans paths) or manual dispatch; builds MkDocs sites, syncs to GCS

Startup scripts: - scripts/cloud_run_start_cloud.sh — Django entrypoint: validates SECRET_KEY and ALLOWED_HOSTS, runs migrate + collectstatic, starts gunicorn

Documentation: - apps/docs/CLOUD_RUN_DEPLOYMENT.md — full deployment guide (one-time setup, manual deploy, automated deploy, prerequisites, runtime entrypoints, hardening roadmap) - apps/docs/GCP_HOSTING_RUNBOOK.md — operational runbook (topology, services, env vars, GitHub secrets inventory, recommended Django env/secrets) - docs/RAILWAY_DEPLOYMENT.md — legacy Railway docs (Playground only)

Environment config: - .env.example — documents all GCP-related vars (GCP_PROJECT_ID, GCP_REGION, ARTIFACT_REPO, service name overrides, Cloud SQL connection, bucket names)

What remains (unblocked items first):

  1. Publish static sites — push docs, tasks, and ASQL docs into the provisioned buckets and verify the hosting path.
  2. Land deploy fixes — commit the local Docker/workflow/runtime changes that made the hosted services healthy.
  3. Migrate to Workload Identity Federation — create the provider/binding in GCP and switch GitHub secrets from GCP_SA_KEY_JSON to WIF secrets.
  4. Monitoring — uptime checks, error-budget alerts, Cloud Run metrics dashboards.
  5. Rollback testing — verify gcloud run services update-traffic for instant rollback.
  6. Internal access/domains — finish the Fivetran-standard subdomain + access model once infra confirms the DNS/auth path.

Go-live checklist

  • [x] Confirm target project: internal-dataface-eng
  • [ ] Confirm deploy region (default in repo is us-west1 unless infra chooses otherwise)
  • [x] Enable required APIs:
  • Cloud Run API
  • Cloud Build API
  • Artifact Registry API
  • Secret Manager API
  • Cloud SQL Admin API
  • [x] Create deploy service account for CI/manual deploys
  • [x] Grant deploy service account roles:
  • Cloud Run Admin
  • Service Account User
  • Cloud Build Editor
  • Artifact Registry Writer
  • [ ] Decide auth model for deploy:
  • Short-term: GitHub Actions JSON key (GCP_SA_KEY_JSON)
  • Follow-up hardening: Workload Identity Federation
  • [x] Create Artifact Registry repo (dataface, unless renamed)
  • [x] Provision Cloud SQL Postgres for dataface-cloud
  • [ ] Decide Cloud SQL connectivity model and connection string delivery
  • [~] Create Secret Manager secrets:
  • [x] DJANGO_SECRET_KEY
  • [ ] OPENAI_API_KEY if AI features are enabled
  • [x] database credential secret(s) or fully assembled DATABASE_URL
  • [x] Ensure the runtime service account used by Cloud Run can read required secrets
  • [x] Create GCS buckets:
  • docs bucket
  • tasks bucket
  • ASQL docs bucket
  • [x] Grant deploy identity bucket write access for static-site deploys
  • [~] Configure GitHub repository secrets:
  • GCP_PROJECT_ID
  • GCP_REGION
  • ARTIFACT_REPO
  • GCP_SA_KEY_JSON if using JSON-key auth
  • GCP_WORKLOAD_IDENTITY_PROVIDER if using WIF
  • GCP_SERVICE_ACCOUNT_EMAIL if using WIF
  • DOCS_BUCKET
  • TASKS_BUCKET
  • ASQL_DOCS_BUCKET
  • CLOUD_RUN_PLAYGROUND_SERVICE_NAME if overriding default
  • CLOUD_RUN_CLOUD_SERVICE_NAME if overriding default
  • CLOUD_RUN_ALIE_SERVICE_NAME if overriding default
  • CLOUD_RUN_ASQL_PLAYGROUND_SERVICE_NAME if overriding default
  • [x] Configure Cloud Run runtime env/secrets for Playground as needed
  • [x] Configure Cloud Run runtime env/secrets for A lIe as needed
  • [x] Configure Cloud Run runtime env/secrets for Django Cloud service:
  • DEBUG=False
  • ALLOWED_HOSTS=<cloud-run-host-or-custom-domain>
  • DATABASE_URL=<postgres-connection-string>
  • REPOS_ROOT=/tmp/dataface-repos
  • SECRET_KEY from Secret Manager
  • OPENAI_API_KEY from Secret Manager if AI features are enabled
  • [x] Verify whether Cloud Run services should remain:
  • Playground: public (--allow-unauthenticated)
  • Cloud: IAM-gated by default
  • A lIe: IAM-gated by default
  • [x] Run first deploy:
  • just deploy-playground
  • just deploy-cloud
  • just deploy-alie
  • just deploy-asql-playground
  • or GitHub workflow dispatch for selected services
  • [x] Smoke test deployed URLs and auth expectations
  • [x] Verify Django migrations and static collection complete on startup
  • [ ] Verify rollback path with Cloud Run revision traffic switch
  • [ ] Add monitoring:
  • uptime checks
  • error alerting
  • service dashboards
  • [ ] Track WIF migration as post-first-deploy hardening

Key commits (chronological):

  • 21504dd7 — switch Cloud Run deploy to image-based flow
  • e84112df — add publish script + just publish for Artifact Registry
  • efe5b6fe — multi-agent rollup: Cloud Run deploy, cbox, docs, examples
  • 5a3e5fe0 — fix round 2 review: container hardening, auth control, deploy guards
  • 999ea8e0 — fix round 3 review: docs accuracy, stale docstrings
  • 0ee1ea09 — multi-agent rollup: Cloud Run deploy, cbox, docs, examples (#483)

Review Feedback

  • 2026-03-07: Investigation by agent. All deploy infrastructure confirmed present in main. No GCP project has been provisioned yet — blocked on receiving a Fivetran project ID. The duplicate task at integrations-platform/tasks/task-m1-fivetran-gcp-deploy-path.md covers the same ground and should be consolidated or cross-referenced.

Pending — blocked on GCP project provisioning.

  • [ ] Review cleared

2026-03-22 Triage Decision

  • Status set to blocked (not obsolete): deploy/tooling/runbook groundwork is merged, but operational cutover is blocked on external infra ownership inputs (Fivetran GCP project provisioning and secrets wiring).