Deployment Guide

Overview

This guide covers deploying ML pipelines across all environments: sandbox (local), dev, staging, and production. The deployment architecture uses GitHub Actions for CI/CD with environment-specific service principals.

Deployment Architecture

4-Tier Environment Strategy

The platform uses a 4-tier deployment architecture (Sandbox → Dev → Staging → Prod) with progressive promotion gates. For the architectural rationale and detailed decision, see ADR-001: Four-Tier Deployment Architecture.

Deployment Flow: Sandbox (local) → Dev (auto on merge) → Staging (auto with approval) → Production (auto with approval)

Environment
Catalog
Deployed By
Service Principal
Purpose

Sandbox

{username}_sandbox

Developer (manual)

User credentials

Rapid local iteration

Dev

dev

GitHub Actions

ml-pipelines-dev

Integration testing

Staging

staging

GitHub Actions

ml-pipelines-staging

Pre-prod validation

Production

prod

GitHub Actions

ml-pipelines-prod

Live workloads

Prerequisites and Setup

1. Local Development Setup

Required Tools:

# Databricks CLI
databricks --version  # Should be >= 0.200.0

# UV package manager
uv --version

# AWS CLI (for S3 volume access)
aws --version

# Git
git --version

Configure Databricks Profiles:

File: ~/.databrickscfg

Authenticate:

2. Service Principal Setup

See Service Principals Guide for detailed setup instructions.

Service Principals Required:

  • ml-pipelines-dev-service-principal (UUID: 03ff99cd-a352-40bb-9d33-414c9ad9e7aa)

  • ml-pipelines-staging-service-principal (UUID: 93bda7cf-b009-49d8-8e8d-046677c8597e)

  • ml-pipelines-prod-service-principal (UUID: 2af4b95a-d80f-4da6-bfba-9ec5a8a8ec9f)

Verify Service Principals:

3. GitHub Secrets

Required Secrets:

  • GH_PAT - GitHub Personal Access Token (for checking out code with submodules)

GitHub Environments:

  • development - For dev deployments

  • staging - For staging deployments

  • production - For production deployments

Sandbox Deployment (Developer Local)

Deploy to Your Sandbox

What Happens:

  1. Validates databricks.yml configuration

  2. Determines your username (e.g., taylor)

  3. Creates catalog taylor_sandbox if it doesn't exist

  4. Creates S3 volume directories for your catalog

  5. Builds Python wheel package with uv build --wheel

  6. Deploys all pipelines and jobs to dev workspace

  7. Resources named as [dev taylor] resource_name_sandbox

Verify Sandbox Deployment:

Copy Sample Data (optional):

Sandbox Cleanup

Dev Deployment (Automatic on PR Merge)

Workflow Trigger

File: .github/workflows/ml_pipelines_dev_deploy.yml

Triggers:

  • Push to main branch

  • Manual workflow dispatch (optional)

Process:

  1. PR opened → Bundle validation runs

  2. PR merged to main → Dev deployment triggers automatically

  3. Service principal ml-pipelines-dev deploys to dev workspace

Manual Dev Deployment

Note: Local dev deployment requires service principal credentials, which are typically only available in CI/CD. For local testing, use sandbox instead.

Verify Dev Deployment

Troubleshooting Dev Deployment

Check GitHub Actions logs:

  1. Go to repository → Actions tab

  2. Click "Deploy to Development" workflow

  3. Review step logs for errors

Common Issues:

  • Bundle validation failed: Fix databricks.yml syntax errors

  • Authentication failed: Verify service principal OIDC configuration

  • Catalog not found: Check catalog permissions for service principal

  • Wheel build failed: Check pyproject.toml dependencies

Staging Deployment (Automatic After Dev)

Workflow Trigger

File: .github/workflows/ml_pipelines_staging_deploy.yml

Triggers:

  • Automatically after successful dev deployment

  • Manual workflow dispatch (for hotfixes)

Condition: Only runs if dev deployment succeeded

Deployment Process

What Happens:

  1. Dev deployment completes successfully

  2. Staging workflow automatically triggered

  3. Service principal ml-pipelines-staging deploys to staging workspace

  4. All pipelines/jobs deployed to staging catalog

Verify Staging Deployment

Staging-Specific Considerations

Paused Schedules: Staging jobs are paused by default to avoid unnecessary runs:

Manual Trigger (if needed):

Production Deployment (Automatic After Staging)

Workflow Trigger

File: .github/workflows/ml_pipelines_prod_deploy.yml

Triggers:

  • Automatically after successful staging deployment

  • Manual workflow dispatch (for emergency deployments)

Condition: Only runs if staging deployment succeeded

Deployment Process

What Happens:

  1. Staging deployment completes successfully

  2. Production workflow automatically triggered

  3. Service principal ml-pipelines-prod deploys to production workspace

  4. All pipelines/jobs deployed to production catalog

  5. Schedules are UNPAUSED for production workloads

Production Safety Checks

Unpaused Schedules: Only production runs on schedule automatically:

Concurrency Limits:

Verify Production Deployment

Production Monitoring

Check Pipeline Health:

Verification Steps for Each Environment

Post-Deployment Checklist

Sandbox Verification

Dev Verification

Staging Verification

Production Verification

Rollback Procedures

Rollback Strategy

  1. Identify the problem (failed deployment, data corruption, model issues)

  2. Assess impact (which environment is affected)

  3. Choose rollback method (git revert, pipeline reset, data restore)

  4. Execute rollback (steps below)

  5. Verify (run verification checks)

For configuration or code issues:

Pipeline-Specific Rollback

Reset a pipeline to previous state:

Data Rollback (Time Travel)

Restore data to previous version:

Model Rollback

Revert to previous model version:

Emergency Rollback (Full Environment)

For catastrophic failures:

Common Deployment Issues

1. Bundle Validation Failures

Error: Error: databricks.yml validation failed

Causes:

  • YAML syntax errors

  • Invalid variable references

  • Missing required fields

Solutions:

2. Authentication Failures

Error: Error: OIDC authentication failed

Causes:

  • Service principal not configured

  • Incorrect Client ID

  • OIDC trust not set up

Solutions:

3. Permission Errors

Error: Error: Permission denied on catalog

Causes:

  • Service principal lacks catalog permissions

  • Catalog doesn't exist

  • Wrong workspace

Solutions:

4. Pipeline Deployment Failures

Error: Error: Pipeline deployment failed

Causes:

  • Invalid SQL syntax

  • Missing catalog references

  • Schema conflicts

Solutions:

GitHub Actions Workflow Explanation

Workflow Structure

1. PR Validation (ml_pipelines_pr_validate.yml):

2. Dev Deployment (ml_pipelines_dev_deploy.yml):

3. Staging Deployment (ml_pipelines_staging_deploy.yml):

4. Production Deployment (ml_pipelines_prod_deploy.yml):

Workflow Dependencies

Service Principal Authentication

For complete service principal setup and GitHub OIDC configuration, see:

Key Benefits:

  • Passwordless authentication (no secrets in GitHub)

  • Environment-specific service principals for isolation

  • Automatic token rotation and audit trail

  • Scoped to specific repositories and environments

Best Practices

1. Always Test in Sandbox First

2. Monitor Deployment Progress

3. Use Feature Flags for Risky Changes

4. Gradual Rollout

  • Deploy to dev → Validate

  • Deploy to staging → Validate with prod data

  • Deploy to prod → Monitor closely

5. Keep Environments in Sync

  • Same Databricks Runtime version

  • Same library versions

  • Same configuration (except environment-specific)

This Repository

Cross-Repository Documentation

  • infra-corearrow-up-right - Terraform infrastructure and service principal configuration

    • Manages: Databricks workspaces, Unity Catalog, VPC, cross-region replication

  • api-corearrow-up-right - API deployment and integration

    • Consumes: ML pipeline outputs via Databricks SQL endpoints

  • app-webarrow-up-right - Frontend deployment

    • Displays: Insights and analytics from ML pipelines

Emergency Contacts

For deployment issues:

  • Taylor Laing ([email protected]) - Admin access

  • Check #ml-pipelines Slack channel

  • Review GitHub Actions logs

  • Check Databricks workspace notifications

Last updated