CI/CD Pipeline Architecture

Overview

The ML Pipelines project uses a fully automated CI/CD pipeline powered by GitHub Actions and Databricks Asset Bundles. The pipeline implements a 4-tier deployment strategy (sandbox, dev, staging, prod) with progressive promotion gates ensuring production safety while maintaining developer velocity.

Architecture Principles

  1. GitHub OIDC Authentication: Passwordless authentication using OpenID Connect federation

  2. Service Principal Isolation: Each environment has its own dedicated service principal

  3. Progressive Deployment: Automatic deployment through environments with validation gates

  4. Infrastructure as Code: All resources defined in YAML and deployed via Databricks Asset Bundles

  5. Binary Promotion: Models trained in staging are promoted (not retrained) to production

Deployment Workflow

For complete deployment workflow details and step-by-step instructions, see Deployment Guide.

The deployment follows a 4-tier strategy (Sandbox → Dev → Staging → Prod) with progressive promotion gates. See ADR-001: Four-Tier Deployment Architecture for the architectural rationale.

Workflow Summary

  1. Sandbox: Developer local testing (make deploy → personal catalog)

  2. Dev: Automatic on PR merge (CI/CD via GitHub OIDC)

  3. Staging: Automatic after dev success (with approval gate)

  4. Production: Automatic after staging success (with approval gate)

Each stage uses environment-specific service principals for authentication and catalog isolation.

GitHub Workflows

1. PR Validation Workflow

File: .github/workflows/ml_pipelines_pr_validate.yml

Purpose: Validate bundle configuration on every pull request to catch errors early.

Trigger:

Steps:

  1. Checkout code from PR branch

  2. Install Databricks CLI (latest version)

  3. Set up Python 3.13 environment

  4. Install make and uv package manager

  5. Run make validate-dev to validate bundle

  6. Post success/failure comment on PR

Authentication:

  • Method: GitHub OIDC

  • Service Principal: ml-pipelines-dev (03ff99cd-a352-40bb-9d33-414c9ad9e7aa)

  • Workspace: Dev workspace (dbc-a72d6af9-df3d)

Permissions:

Success Criteria: Bundle validation passes without errors


2. Dev Deployment Workflow

File: .github/workflows/ml_pipelines_dev_deploy.yml

Purpose: Deploy code changes to the shared dev environment on every merge to main.

Trigger:

Steps:

  1. Checkout code with submodules

  2. Install Databricks CLI

  3. Set up Python 3.13 environment

  4. Install make and uv package manager

  5. Run make deploy-dev which:

    • Validates bundle configuration

    • Builds Python wheel (uv build --wheel)

    • Deploys to dev workspace

    • Updates all jobs and pipelines

Authentication:

  • Method: GitHub OIDC

  • Service Principal: ml-pipelines-dev (03ff99cd-a352-40bb-9d33-414c9ad9e7aa)

  • Workspace: Dev workspace (dbc-a72d6af9-df3d)

Environment Variables:

Success Criteria: Deployment completes successfully; all resources updated

What Gets Deployed:

  • Updated Python wheel to dev catalog

  • DLT pipeline configurations

  • Job definitions

  • Model registration jobs

  • External volumes (if changed)


3. Staging Deployment Workflow

File: .github/workflows/ml_pipelines_staging_deploy.yml

Purpose: Deploy to staging environment and train models on production data for final validation.

Trigger:

Condition: Only runs if dev deployment succeeded

Steps:

  1. Checkout code with submodules

  2. Install Databricks CLI

  3. Set up Python 3.13 environment

  4. Install make and uv package manager

  5. Run make deploy-staging which:

    • Validates bundle configuration

    • Builds Python wheel

    • Deploys to staging workspace

    • Trains models on prod data (read from prod catalog)

    • Saves models to staging.models schema

Authentication:

  • Method: GitHub OIDC

  • Service Principal: ml-pipelines-staging (93bda7cf-b009-49d8-8e8d-046677c8597e)

  • Workspace: Staging workspace (dbc-fab2a42a-8d11)

Environment: staging (requires manual approval in GitHub)

Key Principle: Staging trains models on production data to ensure the same model binary tested in staging goes to production. Data is pulled in to be trained, but prod data (raw data and content) stays in prod.

Success Criteria:

  • Deployment completes successfully

  • Models trained and registered in staging.models

  • Validation tests pass


4. Production Deployment Workflow

File: .github/workflows/ml_pipelines_prod_deploy.yml

Purpose: Promote validated models and deploy pipelines to production.

Trigger:

Condition: Only runs if staging deployment succeeded

Steps:

  1. Checkout code with submodules

  2. Install Databricks CLI

  3. Set up Python 3.13 environment

  4. Install make and uv package manager

  5. Run make deploy-prod which:

    • Validates bundle configuration

    • Builds Python wheel

    • Promotes model binaries from staging.models to prod.models (NO RETRAINING)

    • Deploys pipelines to production workspace

    • Enables production job schedules (UNPAUSED)

Authentication:

  • Method: GitHub OIDC

  • Service Principal: ml-pipelines-prod (2af4b95a-d80f-4da6-bfba-9ec5a8a8ec9f)

  • Workspace: Production workspace (dbc-028d9e53-7ce6)

Environment: production (requires manual approval in GitHub)

Key Principle: Models are PROMOTED (binary copied), not retrained in production. This ensures the exact model tested in staging runs in production.

Success Criteria:

  • Deployment completes successfully

  • Model binaries promoted to prod.models

  • Pipelines deployed and enabled

  • No retraining occurs


GitHub OIDC Authentication

For complete details on GitHub OIDC setup, service principal configuration, and authentication workflows, see:

Key Points

  • Passwordless Authentication: GitHub OIDC eliminates long-lived secrets

  • Environment Isolation: Each environment has dedicated service principal

  • Automatic Token Management: Short-lived tokens that expire after workflow completion

  • Full Audit Trail: All actions logged to service principal identity

Service Principal Summary

Environment
Service Principal ID
Workspace

Dev

03ff99cd-a352-40bb-9d33-414c9ad9e7aa

dbc-a72d6af9-df3d

Staging

93bda7cf-b009-49d8-8e8d-046677c8597e

dbc-fab2a42a-8d11

Prod

2af4b95a-d80f-4da6-bfba-9ec5a8a8ec9f

dbc-028d9e53-7ce6

For detailed permissions per service principal, see Service Principals Guide.


Deployment Gates and Approvals

Dev Environment

Approval: None required (automatic on merge to main after PR completion)

Validation:

  • Bundle configuration validation

  • Python wheel build success

  • No syntax errors

Risk Level: Low (shared development environment)

Protection Rules (configured in GitHub):

  • Requires manual review & approval from lead developer to merge PR


Staging Environment

Approval: Required manual approval via GitHub environment protection rules

Validation:

  • Dev deployment succeeded

  • Bundle configuration validation

  • Model training on prod data succeeds

  • Integration tests pass

Risk Level: Medium (uses production data, pre-production testing)

Protection Rules (configured in GitHub):

  • Requires approval from lead developer to deploy in staging workspace/environment


Production Environment

Approval: Required manual approval via GitHub environment protection rules

Validation:

  • Staging deployment succeeded

  • Models validated in staging

  • Bundle configuration validation

  • All tests passed

Risk Level: High (production environment)

Protection Rules (configured in GitHub):

  • Require manual approval from lead developer [@taylorlaing]

  • Wait timer: Optional - not currently in use (e.g., 1 hour cool-off period)

  • Branch protection: Only from main branch


Rollback Strategies

Application Code Rollback

Scenario: Bug introduced in latest deployment

Steps:

  1. Identify the last known good commit

  2. Revert the problematic commit:

  3. Open new PR with reverted code

  4. Review, approve, and merge PR into main branch

  5. CI/CD automatically deploys the reverted version

  6. Monitor deployment to ensure fix

Timeline: 5-10 minutes


Model Rollback

Scenario: Model performance degraded in production

Steps:

  1. Identify previous model version:

  2. Update model serving endpoint to previous version:

  3. Verify endpoint health and test predictions

Timeline: 2-5 minutes (endpoint update time)


Pipeline Rollback

Scenario: DLT pipeline causing data quality issues

Steps:

  1. Stop the running pipeline:

  2. If schema issues, restore from Delta Lake time travel:

  3. Revert pipeline code and redeploy:

  4. Open new PR with reverted code

  5. Review, approve, and merge PR into main branch

  6. Restart pipeline with fixed version

Timeline: 10-15 minutes


Full Environment Rollback

Scenario: Critical production outage

Steps:

  1. Stop all production pipelines:

  2. Pause all production jobs:

  3. Revert to last stable commit:

  4. Open new PR with reverted code

  5. Review, approve, and merge PR into main branch

  6. Monitor deployment through all environments

  7. Gradually resume services after validation

Timeline: 20-30 minutes


Security Model

Least Privilege Access

Principle: Each service principal has only the permissions needed for its environment.

Permission Type
Dev
Staging
Prod

Own catalog write

dev

staging

prod

Read other catalogs

staging (read)

prod (read)

staging.models (read)

Create jobs

Create pipelines

Create endpoints

Delete production data


Audit Trail

All CI/CD actions are logged with:

  • GitHub Actions logs: Complete workflow execution history

  • Databricks audit logs: All API calls logged to service principal

  • MLflow tracking: Model training and registration events

  • Git history: Complete code change history

Viewing Audit Logs:


Secret Management

GitHub Secrets (if needed):

  • GH_PAT: Personal access token for submodule access (only secret used)

Databricks Secrets: Not used in CI/CD (OIDC replaces secrets)

Best Practices:

  • Never commit credentials to code

  • Use GitHub OIDC for all authentication

  • Rotate GH_PAT every 90 days

  • Use Databricks secret scopes for runtime credentials


Monitoring and Observability

Workflow Monitoring

GitHub Actions Dashboard:

  • View all workflow runs: https://github.com/refresh-os/ml-pipelines/actions

  • Filter by status: Failed, In Progress, Success

  • Download logs for failed runs

Slack Notifications (if configured):


Deployment Metrics

Key Metrics to Track:

  • Deployment frequency (multiple per week)

  • Deployment success rate (target: >95%)

  • Mean time to deploy (target: <30 minutes)

  • Mean time to recovery (target: <30 minutes)

  • Change failure rate (target: <5%)

Tracking:


Troubleshooting CI/CD Issues

Workflow Not Triggering

Symptoms: PR merged but dev workflow didn't start

Causes:

  • Path filters excluding changed files

  • Workflow syntax error

  • GitHub Actions disabled

Solution: See Troubleshooting Guide


Bundle Validation Failed

Symptoms: Validation step fails in workflow

Causes:

  • YAML syntax errors in databricks.yml

  • Invalid variable references

  • Missing required fields

Solution: See Troubleshooting Guide


OIDC Authentication Failed

Symptoms: Authentication error in workflow

Causes:

  • Service principal OIDC not configured

  • GitHub environment mismatch

  • Client ID incorrect

Solution: See Troubleshooting Guide


Staging Deployment Not Triggering

Symptoms: Dev succeeds but staging doesn't start

Causes:

  • workflow_run condition not met

  • Environment approval pending

  • Workflow disabled

Solution: See Troubleshooting Guide


Last updated