CI/CD Pipeline Architecture
Overview
The ML Pipelines project uses a fully automated CI/CD pipeline powered by GitHub Actions and Databricks Asset Bundles. The pipeline implements a 4-tier deployment strategy (sandbox, dev, staging, prod) with progressive promotion gates ensuring production safety while maintaining developer velocity.
Architecture Principles
GitHub OIDC Authentication: Passwordless authentication using OpenID Connect federation
Service Principal Isolation: Each environment has its own dedicated service principal
Progressive Deployment: Automatic deployment through environments with validation gates
Infrastructure as Code: All resources defined in YAML and deployed via Databricks Asset Bundles
Binary Promotion: Models trained in staging are promoted (not retrained) to production
Deployment Workflow
For complete deployment workflow details and step-by-step instructions, see Deployment Guide.
The deployment follows a 4-tier strategy (Sandbox → Dev → Staging → Prod) with progressive promotion gates. See ADR-001: Four-Tier Deployment Architecture for the architectural rationale.
Workflow Summary
Sandbox: Developer local testing (
make deploy→ personal catalog)Dev: Automatic on PR merge (CI/CD via GitHub OIDC)
Staging: Automatic after dev success (with approval gate)
Production: Automatic after staging success (with approval gate)
Each stage uses environment-specific service principals for authentication and catalog isolation.
GitHub Workflows
1. PR Validation Workflow
File: .github/workflows/ml_pipelines_pr_validate.yml
Purpose: Validate bundle configuration on every pull request to catch errors early.
Trigger:
Steps:
Checkout code from PR branch
Install Databricks CLI (latest version)
Set up Python 3.13 environment
Install make and uv package manager
Run
make validate-devto validate bundlePost success/failure comment on PR
Authentication:
Method: GitHub OIDC
Service Principal:
ml-pipelines-dev(03ff99cd-a352-40bb-9d33-414c9ad9e7aa)Workspace: Dev workspace (dbc-a72d6af9-df3d)
Permissions:
Success Criteria: Bundle validation passes without errors
2. Dev Deployment Workflow
File: .github/workflows/ml_pipelines_dev_deploy.yml
Purpose: Deploy code changes to the shared dev environment on every merge to main.
Trigger:
Steps:
Checkout code with submodules
Install Databricks CLI
Set up Python 3.13 environment
Install make and uv package manager
Run
make deploy-devwhich:Validates bundle configuration
Builds Python wheel (uv build --wheel)
Deploys to dev workspace
Updates all jobs and pipelines
Authentication:
Method: GitHub OIDC
Service Principal:
ml-pipelines-dev(03ff99cd-a352-40bb-9d33-414c9ad9e7aa)Workspace: Dev workspace (dbc-a72d6af9-df3d)
Environment Variables:
Success Criteria: Deployment completes successfully; all resources updated
What Gets Deployed:
Updated Python wheel to dev catalog
DLT pipeline configurations
Job definitions
Model registration jobs
External volumes (if changed)
3. Staging Deployment Workflow
File: .github/workflows/ml_pipelines_staging_deploy.yml
Purpose: Deploy to staging environment and train models on production data for final validation.
Trigger:
Condition: Only runs if dev deployment succeeded
Steps:
Checkout code with submodules
Install Databricks CLI
Set up Python 3.13 environment
Install make and uv package manager
Run
make deploy-stagingwhich:Validates bundle configuration
Builds Python wheel
Deploys to staging workspace
Trains models on prod data (read from prod catalog)
Saves models to staging.models schema
Authentication:
Method: GitHub OIDC
Service Principal:
ml-pipelines-staging(93bda7cf-b009-49d8-8e8d-046677c8597e)Workspace: Staging workspace (dbc-fab2a42a-8d11)
Environment: staging (requires manual approval in GitHub)
Key Principle: Staging trains models on production data to ensure the same model binary tested in staging goes to production. Data is pulled in to be trained, but prod data (raw data and content) stays in prod.
Success Criteria:
Deployment completes successfully
Models trained and registered in staging.models
Validation tests pass
4. Production Deployment Workflow
File: .github/workflows/ml_pipelines_prod_deploy.yml
Purpose: Promote validated models and deploy pipelines to production.
Trigger:
Condition: Only runs if staging deployment succeeded
Steps:
Checkout code with submodules
Install Databricks CLI
Set up Python 3.13 environment
Install make and uv package manager
Run
make deploy-prodwhich:Validates bundle configuration
Builds Python wheel
Promotes model binaries from staging.models to prod.models (NO RETRAINING)
Deploys pipelines to production workspace
Enables production job schedules (UNPAUSED)
Authentication:
Method: GitHub OIDC
Service Principal:
ml-pipelines-prod(2af4b95a-d80f-4da6-bfba-9ec5a8a8ec9f)Workspace: Production workspace (dbc-028d9e53-7ce6)
Environment: production (requires manual approval in GitHub)
Key Principle: Models are PROMOTED (binary copied), not retrained in production. This ensures the exact model tested in staging runs in production.
Success Criteria:
Deployment completes successfully
Model binaries promoted to prod.models
Pipelines deployed and enabled
No retraining occurs
GitHub OIDC Authentication
For complete details on GitHub OIDC setup, service principal configuration, and authentication workflows, see:
Service Principals Guide - Comprehensive setup instructions
ADR-003: Service Principal Per Environment - Architecture decision rationale
Key Points
Passwordless Authentication: GitHub OIDC eliminates long-lived secrets
Environment Isolation: Each environment has dedicated service principal
Automatic Token Management: Short-lived tokens that expire after workflow completion
Full Audit Trail: All actions logged to service principal identity
Service Principal Summary
Dev
03ff99cd-a352-40bb-9d33-414c9ad9e7aa
dbc-a72d6af9-df3d
Staging
93bda7cf-b009-49d8-8e8d-046677c8597e
dbc-fab2a42a-8d11
Prod
2af4b95a-d80f-4da6-bfba-9ec5a8a8ec9f
dbc-028d9e53-7ce6
For detailed permissions per service principal, see Service Principals Guide.
Deployment Gates and Approvals
Dev Environment
Approval: None required (automatic on merge to main after PR completion)
Validation:
Bundle configuration validation
Python wheel build success
No syntax errors
Risk Level: Low (shared development environment)
Protection Rules (configured in GitHub):
Requires manual review & approval from lead developer to merge PR
Staging Environment
Approval: Required manual approval via GitHub environment protection rules
Validation:
Dev deployment succeeded
Bundle configuration validation
Model training on prod data succeeds
Integration tests pass
Risk Level: Medium (uses production data, pre-production testing)
Protection Rules (configured in GitHub):
Requires approval from lead developer to deploy in staging workspace/environment
Production Environment
Approval: Required manual approval via GitHub environment protection rules
Validation:
Staging deployment succeeded
Models validated in staging
Bundle configuration validation
All tests passed
Risk Level: High (production environment)
Protection Rules (configured in GitHub):
Require manual approval from lead developer [@taylorlaing]
Wait timer: Optional - not currently in use (e.g., 1 hour cool-off period)
Branch protection: Only from main branch
Rollback Strategies
Application Code Rollback
Scenario: Bug introduced in latest deployment
Steps:
Identify the last known good commit
Revert the problematic commit:
Open new PR with reverted code
Review, approve, and merge PR into main branch
CI/CD automatically deploys the reverted version
Monitor deployment to ensure fix
Timeline: 5-10 minutes
Model Rollback
Scenario: Model performance degraded in production
Steps:
Identify previous model version:
Update model serving endpoint to previous version:
Verify endpoint health and test predictions
Timeline: 2-5 minutes (endpoint update time)
Pipeline Rollback
Scenario: DLT pipeline causing data quality issues
Steps:
Stop the running pipeline:
If schema issues, restore from Delta Lake time travel:
Revert pipeline code and redeploy:
Open new PR with reverted code
Review, approve, and merge PR into main branch
Restart pipeline with fixed version
Timeline: 10-15 minutes
Full Environment Rollback
Scenario: Critical production outage
Steps:
Stop all production pipelines:
Pause all production jobs:
Revert to last stable commit:
Open new PR with reverted code
Review, approve, and merge PR into main branch
Monitor deployment through all environments
Gradually resume services after validation
Timeline: 20-30 minutes
Security Model
Least Privilege Access
Principle: Each service principal has only the permissions needed for its environment.
Own catalog write
dev
staging
prod
Read other catalogs
staging (read)
prod (read)
staging.models (read)
Create jobs
Create pipelines
Create endpoints
Delete production data
Audit Trail
All CI/CD actions are logged with:
GitHub Actions logs: Complete workflow execution history
Databricks audit logs: All API calls logged to service principal
MLflow tracking: Model training and registration events
Git history: Complete code change history
Viewing Audit Logs:
Secret Management
GitHub Secrets (if needed):
GH_PAT: Personal access token for submodule access (only secret used)
Databricks Secrets: Not used in CI/CD (OIDC replaces secrets)
Best Practices:
Never commit credentials to code
Use GitHub OIDC for all authentication
Rotate
GH_PATevery 90 daysUse Databricks secret scopes for runtime credentials
Monitoring and Observability
Workflow Monitoring
GitHub Actions Dashboard:
View all workflow runs:
https://github.com/refresh-os/ml-pipelines/actionsFilter by status: Failed, In Progress, Success
Download logs for failed runs
Slack Notifications (if configured):
Deployment Metrics
Key Metrics to Track:
Deployment frequency (multiple per week)
Deployment success rate (target: >95%)
Mean time to deploy (target: <30 minutes)
Mean time to recovery (target: <30 minutes)
Change failure rate (target: <5%)
Tracking:
Troubleshooting CI/CD Issues
Workflow Not Triggering
Symptoms: PR merged but dev workflow didn't start
Causes:
Path filters excluding changed files
Workflow syntax error
GitHub Actions disabled
Solution: See Troubleshooting Guide
Bundle Validation Failed
Symptoms: Validation step fails in workflow
Causes:
YAML syntax errors in databricks.yml
Invalid variable references
Missing required fields
Solution: See Troubleshooting Guide
OIDC Authentication Failed
Symptoms: Authentication error in workflow
Causes:
Service principal OIDC not configured
GitHub environment mismatch
Client ID incorrect
Solution: See Troubleshooting Guide
Staging Deployment Not Triggering
Symptoms: Dev succeeds but staging doesn't start
Causes:
workflow_run condition not met
Environment approval pending
Workflow disabled
Solution: See Troubleshooting Guide
Related Documentation
Model Promotion Guide - ML model lifecycle and promotion workflow
Service Principals Guide - Authentication setup
Deployment Guide - Manual deployment procedures
Troubleshooting Guide - CI/CD troubleshooting
GitHub Workflows README - Detailed workflow documentation
Last updated