CI/CD Pipeline Architecture

Overview

The ML Pipelines project uses a fully automated CI/CD pipeline powered by GitHub Actions and Databricks Asset Bundles. The pipeline implements a 4-tier deployment strategy (sandbox, dev, staging, prod) with progressive promotion gates ensuring production safety while maintaining developer velocity.

Architecture Principles

GitHub OIDC Authentication: Passwordless authentication using OpenID Connect federation
Service Principal Isolation: Each environment has its own dedicated service principal
Progressive Deployment: Automatic deployment through environments with validation gates
Infrastructure as Code: All resources defined in YAML and deployed via Databricks Asset Bundles
Binary Promotion: Models trained in staging are promoted (not retrained) to production

Deployment Workflow

For complete deployment workflow details and step-by-step instructions, see Deployment Guide.

The deployment follows a 4-tier strategy (Sandbox → Dev → Staging → Prod) with progressive promotion gates. See ADR-001: Four-Tier Deployment Architecture for the architectural rationale.

Workflow Summary

Sandbox: Developer local testing (make deploy → personal catalog)
Dev: Automatic on PR merge (CI/CD via GitHub OIDC)
Staging: Automatic after dev success (with approval gate)
Production: Automatic after staging success (with approval gate)

Each stage uses environment-specific service principals for authentication and catalog isolation.

GitHub Workflows

1. PR Validation Workflow

File: .github/workflows/ml_pipelines_pr_validate.yml

Purpose: Validate bundle configuration on every pull request to catch errors early.

Trigger:

on:
  pull_request:
    branches:
      - main
    paths:
      - 'databricks.yml'
      - 'resources/**'
      - 'src/**'
      - '.github/workflows/**'

Steps:

Checkout code from PR branch
Install Databricks CLI (latest version)
Set up Python 3.13 environment
Install make and uv package manager
Run make validate-dev to validate bundle
Post success/failure comment on PR

Authentication:

Method: GitHub OIDC
Service Principal: ml-pipelines-dev (03ff99cd-a352-40bb-9d33-414c9ad9e7aa)
Workspace: Dev workspace (dbc-a72d6af9-df3d)

Permissions:

permissions:
  id-token: write      # Required for OIDC
  contents: read       # Read repository
  pull-requests: write # Comment on PR

Success Criteria: Bundle validation passes without errors

2. Dev Deployment Workflow

File: .github/workflows/ml_pipelines_dev_deploy.yml

Purpose: Deploy code changes to the shared dev environment on every merge to main.

Trigger:

on:
  push:
    branches:
      - main

Steps:

Checkout code with submodules
Install Databricks CLI
Set up Python 3.13 environment
Install make and uv package manager
Run make deploy-dev which:
- Validates bundle configuration
- Builds Python wheel (uv build --wheel)
- Deploys to dev workspace
- Updates all jobs and pipelines

Authentication:

Method: GitHub OIDC
Service Principal: ml-pipelines-dev (03ff99cd-a352-40bb-9d33-414c9ad9e7aa)
Workspace: Dev workspace (dbc-a72d6af9-df3d)

Environment Variables:

env:
  DATABRICKS_AUTH_TYPE: github-oidc
  DATABRICKS_HOST: https://dbc-a72d6af9-df3d.cloud.databricks.com
  DATABRICKS_CLIENT_ID: 03ff99cd-a352-40bb-9d33-414c9ad9e7aa

Success Criteria: Deployment completes successfully; all resources updated

What Gets Deployed:

Updated Python wheel to dev catalog
DLT pipeline configurations
Job definitions
Model registration jobs
External volumes (if changed)

3. Staging Deployment Workflow

File: .github/workflows/ml_pipelines_staging_deploy.yml

Purpose: Deploy to staging environment and train models on production data for final validation.

Trigger:

on:
  workflow_run:
    workflows: ['Deploy to Development']
    types:
      - completed

Condition: Only runs if dev deployment succeeded

if: ${{ github.event.workflow_run.conclusion == 'success' }}

Steps:

Checkout code with submodules
Install Databricks CLI
Set up Python 3.13 environment
Install make and uv package manager
Run make deploy-staging which:
- Validates bundle configuration
- Builds Python wheel
- Deploys to staging workspace
- Trains models on prod data (read from prod catalog)
- Saves models to staging.models schema

Authentication:

Method: GitHub OIDC
Service Principal: ml-pipelines-staging (93bda7cf-b009-49d8-8e8d-046677c8597e)
Workspace: Staging workspace (dbc-fab2a42a-8d11)

Environment: staging (requires manual approval in GitHub)

Key Principle: Staging trains models on production data to ensure the same model binary tested in staging goes to production. Data is pulled in to be trained, but prod data (raw data and content) stays in prod.

Success Criteria:

Deployment completes successfully
Models trained and registered in staging.models
Validation tests pass

4. Production Deployment Workflow

File: .github/workflows/ml_pipelines_prod_deploy.yml

Purpose: Promote validated models and deploy pipelines to production.

Trigger:

on:
  workflow_run:
    workflows: ['Deploy to Staging']
    types:
      - completed

Condition: Only runs if staging deployment succeeded

if: ${{ github.event.workflow_run.conclusion == 'success' }}

Steps:

Checkout code with submodules
Install Databricks CLI
Set up Python 3.13 environment
Install make and uv package manager
Run make deploy-prod which:
- Validates bundle configuration
- Builds Python wheel
- Promotes model binaries from staging.models to prod.models (NO RETRAINING)
- Deploys pipelines to production workspace
- Enables production job schedules (UNPAUSED)

Authentication:

Method: GitHub OIDC
Service Principal: ml-pipelines-prod (2af4b95a-d80f-4da6-bfba-9ec5a8a8ec9f)
Workspace: Production workspace (dbc-028d9e53-7ce6)

Environment: production (requires manual approval in GitHub)

Key Principle: Models are PROMOTED (binary copied), not retrained in production. This ensures the exact model tested in staging runs in production.

Success Criteria:

Deployment completes successfully
Model binaries promoted to prod.models
Pipelines deployed and enabled
No retraining occurs

GitHub OIDC Authentication

For complete details on GitHub OIDC setup, service principal configuration, and authentication workflows, see:

Service Principals Guide - Comprehensive setup instructions
ADR-003: Service Principal Per Environment - Architecture decision rationale

Key Points

Passwordless Authentication: GitHub OIDC eliminates long-lived secrets
Environment Isolation: Each environment has dedicated service principal
Automatic Token Management: Short-lived tokens that expire after workflow completion
Full Audit Trail: All actions logged to service principal identity

Service Principal Summary

Environment

Service Principal ID

Workspace

Dev

03ff99cd-a352-40bb-9d33-414c9ad9e7aa

dbc-a72d6af9-df3d

Staging

93bda7cf-b009-49d8-8e8d-046677c8597e

dbc-fab2a42a-8d11

Prod

2af4b95a-d80f-4da6-bfba-9ec5a8a8ec9f

dbc-028d9e53-7ce6

For detailed permissions per service principal, see Service Principals Guide.

Deployment Gates and Approvals

Dev Environment

Approval: None required (automatic on merge to main after PR completion)

Validation:

Bundle configuration validation
Python wheel build success
No syntax errors

Risk Level: Low (shared development environment)

Protection Rules (configured in GitHub):

Requires manual review & approval from lead developer to merge PR

Staging Environment

Approval: Required manual approval via GitHub environment protection rules

Validation:

Dev deployment succeeded
Bundle configuration validation
Model training on prod data succeeds
Integration tests pass

Risk Level: Medium (uses production data, pre-production testing)

Protection Rules (configured in GitHub):

Requires approval from lead developer to deploy in staging workspace/environment

Production Environment

Approval: Required manual approval via GitHub environment protection rules

Validation:

Staging deployment succeeded
Models validated in staging
Bundle configuration validation
All tests passed

Risk Level: High (production environment)

Protection Rules (configured in GitHub):

Require manual approval from lead developer [@taylorlaing]
Wait timer: Optional - not currently in use (e.g., 1 hour cool-off period)
Branch protection: Only from main branch

Rollback Strategies

Application Code Rollback

Scenario: Bug introduced in latest deployment

Steps:

Identify the last known good commit
Revert the problematic commit:
```
git revert <commit-sha>
```
Open new PR with reverted code
Review, approve, and merge PR into main branch
CI/CD automatically deploys the reverted version
Monitor deployment to ensure fix

Timeline: 5-10 minutes

Model Rollback

Scenario: Model performance degraded in production

Steps:

Identify previous model version:

SELECT version, creation_timestamp, tags
FROM prod.models.sentiment_analysis
ORDER BY version DESC;

Update model serving endpoint to previous version:

from mlflow.deployments import get_deploy_client
client = get_deploy_client("databricks")

client.update_endpoint(
    endpoint="sentiment_analysis",
    config={
        "served_entities": [{
            "entity_name": "prod.models.sentiment_analysis",
            "entity_version": "3",  # Previous version
            "workload_size": "Small"
        }]
    }
)

Verify endpoint health and test predictions

Timeline: 2-5 minutes (endpoint update time)

Pipeline Rollback

Scenario: DLT pipeline causing data quality issues

Steps:

Stop the running pipeline:

databricks pipelines stop <pipeline_id> --profile ref-prod

If schema issues, restore from Delta Lake time travel:

RESTORE TABLE prod.gold.sentiment_features
TO VERSION AS OF 42;

Revert pipeline code and redeploy:
```
git revert <commit-sha>
```
Open new PR with reverted code
Review, approve, and merge PR into main branch
Restart pipeline with fixed version

Timeline: 10-15 minutes

Full Environment Rollback

Scenario: Critical production outage

Steps:

Stop all production pipelines:

# Emergency stop script
for id in $(databricks pipelines list-pipelines --profile ref-prod --output json | jq -r '.[].pipeline_id'); do
  databricks pipelines stop $id --profile ref-prod
done

Pause all production jobs:

databricks jobs list --profile ref-prod | jq -r '.jobs[].job_id' | while read job_id; do
  databricks jobs update $job_id --pause-status PAUSED --profile ref-prod
done

Revert to last stable commit:

git revert --no-commit HEAD~3..HEAD  # Revert last 3 commits
git commit -m "Emergency rollback to stable version"

Open new PR with reverted code
Review, approve, and merge PR into main branch
Monitor deployment through all environments
Gradually resume services after validation

Timeline: 20-30 minutes

Security Model

Least Privilege Access

Principle: Each service principal has only the permissions needed for its environment.

Permission Type

Dev

Staging

Prod

Own catalog write

dev

staging

prod

Read other catalogs

staging (read)

prod (read)

staging.models (read)

Create jobs

Create pipelines

Create endpoints

Delete production data

Audit Trail

All CI/CD actions are logged with:

GitHub Actions logs: Complete workflow execution history
Databricks audit logs: All API calls logged to service principal
MLflow tracking: Model training and registration events
Git history: Complete code change history

Viewing Audit Logs:

# GitHub Actions logs
gh run list --repo refresh-os/ml-pipelines

# Databricks audit logs (UI)
# Databricks Console → Admin Console → Audit Logs
# Filter by service_principal_id

# MLflow model history
databricks sql execute "
  SELECT version, creation_timestamp, user_id, tags
  FROM prod.models.sentiment_analysis
  ORDER BY version DESC
" --profile ref-prod

Secret Management

GitHub Secrets (if needed):

GH_PAT: Personal access token for submodule access (only secret used)

Databricks Secrets: Not used in CI/CD (OIDC replaces secrets)

Best Practices:

Never commit credentials to code
Use GitHub OIDC for all authentication
Rotate GH_PAT every 90 days
Use Databricks secret scopes for runtime credentials

Monitoring and Observability

Workflow Monitoring

GitHub Actions Dashboard:

View all workflow runs: https://github.com/refresh-os/ml-pipelines/actions
Filter by status: Failed, In Progress, Success
Download logs for failed runs

Slack Notifications (if configured):

# Add to workflow (example)
- name: Notify on failure
  if: failure()
  uses: slackapi/slack-github-action@v1
  with:
    webhook-url: ${{ secrets.SLACK_WEBHOOK }}
    payload: |
      {
        "text": "Deployment failed: ${{ github.workflow }}"
      }

Deployment Metrics

Key Metrics to Track:

Deployment frequency (multiple per week)
Deployment success rate (target: >95%)
Mean time to deploy (target: <30 minutes)
Mean time to recovery (target: <30 minutes)
Change failure rate (target: <5%)

Tracking:

# Deployment frequency (last 30 days)
gh run list --repo refresh-os/ml-pipelines --created ">=2025-01-01" --limit 1000 \
  | jq '[.[] | select(.workflow_name == "Deploy to Production")] | length'

# Success rate
gh run list --repo refresh-os/ml-pipelines --status completed \
  | jq '[.[] | select(.conclusion == "success")] | length'

Troubleshooting CI/CD Issues

Workflow Not Triggering

Symptoms: PR merged but dev workflow didn't start

Causes:

Path filters excluding changed files
Workflow syntax error
GitHub Actions disabled

Solution: See Troubleshooting Guide

Bundle Validation Failed

Symptoms: Validation step fails in workflow

Causes:

YAML syntax errors in databricks.yml
Invalid variable references
Missing required fields

Solution: See Troubleshooting Guide

OIDC Authentication Failed

Symptoms: Authentication error in workflow

Causes:

Service principal OIDC not configured
GitHub environment mismatch
Client ID incorrect

Solution: See Troubleshooting Guide

Staging Deployment Not Triggering

Symptoms: Dev succeeds but staging doesn't start

Causes:

workflow_run condition not met
Environment approval pending
Workflow disabled

Solution: See Troubleshooting Guide

Model Promotion Guide - ML model lifecycle and promotion workflow
Service Principals Guide - Authentication setup
Deployment Guide - Manual deployment procedures
Troubleshooting Guide - CI/CD troubleshooting
GitHub Workflows README - Detailed workflow documentation

PreviousArchitecture NextData Flow Architecture

Last updated 5 months ago

hashtagOverview

hashtagArchitecture Principles

hashtagDeployment Workflow

hashtagWorkflow Summary

hashtagGitHub Workflows

hashtag1. PR Validation Workflow

hashtag2. Dev Deployment Workflow

hashtag3. Staging Deployment Workflow

hashtag4. Production Deployment Workflow

hashtagGitHub OIDC Authentication

hashtagKey Points

hashtagService Principal Summary

hashtagDeployment Gates and Approvals

hashtagDev Environment

hashtagStaging Environment

hashtagProduction Environment

hashtagRollback Strategies

hashtagApplication Code Rollback

hashtagModel Rollback

hashtagPipeline Rollback

hashtagFull Environment Rollback

hashtagSecurity Model

hashtagLeast Privilege Access

hashtagAudit Trail

hashtagSecret Management

hashtagMonitoring and Observability

hashtagWorkflow Monitoring

hashtagDeployment Metrics

hashtagTroubleshooting CI/CD Issues

hashtagWorkflow Not Triggering

hashtagBundle Validation Failed

hashtagOIDC Authentication Failed

hashtagStaging Deployment Not Triggering

hashtagRelated Documentation

Overview

Architecture Principles

Deployment Workflow

Workflow Summary

GitHub Workflows

1. PR Validation Workflow

2. Dev Deployment Workflow

3. Staging Deployment Workflow

4. Production Deployment Workflow

GitHub OIDC Authentication

Key Points

Service Principal Summary

Deployment Gates and Approvals

Dev Environment

Staging Environment

Production Environment

Rollback Strategies

Application Code Rollback

Model Rollback

Pipeline Rollback

Full Environment Rollback

Security Model

Least Privilege Access

Audit Trail

Secret Management

Monitoring and Observability

Workflow Monitoring

Deployment Metrics

Troubleshooting CI/CD Issues

Workflow Not Triggering

Bundle Validation Failed

OIDC Authentication Failed

Staging Deployment Not Triggering

Related Documentation