CI/CD Implementation Plan for ML Pipelines

Databricks Asset Bundles with Unity Catalog Data Isolation

Date: 2025-09-30 Status: Implementation Ready Architecture Pattern: Developer Sandbox → Shared Dev → Staging → Production


Table of Contents


Executive Summary

Problem Statement

Current setup lacks proper developer isolation and has workspace configuration issues:

  • Dev and Prod share the same workspace (compliance/security risk)

  • No developer sandbox isolation (conflicts between Taylor & William)

  • Hardcoded hosts in databricks.yml (not using profiles)

  • GitHub workflows use incorrect deployment commands

  • No clear data isolation strategy between developers

Proposed Solution

Implement a 4-tier deployment strategy with Unity Catalog-based data isolation:

  1. Sandbox (Local) - Developer-specific catalogs (taylor_sandbox, william_sandbox)

  2. Dev (CI/CD) - Shared integration catalog (dev)

  3. Staging (CI/CD) - Pre-production catalog (staging)

  4. Production (CI/CD) - Production catalog (prod)

Benefits

Zero conflicts - Each developer has isolated sandbox catalog No waiting - Parallel development without blocking Security - Proper workspace isolation (dev/staging/prod separate) Compliance - Clear data lineage and promotion path Cost efficient - Sandbox catalogs auto-clean, serverless pipelines Industry standard - Follows Databricks best practices


Current State Analysis

Workspace Configuration CONFIRMED

Environment
Workspace ID
Current Host
Status

Dev

dbc-a72d6af9-df3d

https://dbc-a72d6af9-df3d.cloud.databricks.com

Correct

Staging

dbc-fab2a42a-8d11

https://dbc-fab2a42a-8d11.cloud.databricks.com

Correct

Prod

dbc-028d9e53-7ce6

https://dbc-028d9e53-7ce6.cloud.databricks.com

Confirmed (needs config update)

Note: databricks.yml currently has prod pointing to dev workspace - will be fixed in Phase 2

Unity Catalog Structure

Databricks Profiles

Located in ~/.databrickscfg:

  • ref-dev → Dev workspace

  • ref-staging → Staging workspace

  • ref-prod → Prod workspace

Service Principals (Environment-Specific)

Current: Single github-service-principal (f920d175-cf7c-43a2-a6c0-e9ccb42c02d2)

New Architecture (see SERVICE_PRINCIPAL_SETUP.mdarrow-up-right):

  • ml-pipelines-dev-service-principal - For dev deployments

  • ml-pipelines-staging-service-principal - For staging deployments

  • ml-pipelines-prod-service-principal - For prod deployments

Benefit: Environment isolation, better audit trail, SOC2 compliance

Current Pipelines

Both developers deploying independently:

  • Taylor: [dev taylor] bronze-ingestion-dev, [dev taylor] sentiment-analysis-dev, etc.

  • William: ref_dev_william_bronze-ingestion-dev, ref_dev_william_emoji-analysis-dev, etc.

Issue: Inconsistent naming, no clear separation of sandbox vs shared dev


Target Architecture

Deployment Flow

Data Isolation Strategy

Unity Catalog Hierarchy

MLOps Model Promotion Strategy

Shared Metastore Architecture

Key Insight: All workspaces (dev, staging, prod) share the same Unity Catalog metastore (4c140405-61ff-4e4e-9684-07078639218e). This enables efficient model promotion without retraining.

Model Training & Promotion Flow

Model Promotion Implementation

Python Code for Promotion (staging/prod jobs):

Benefits of This Approach

No Duplicate Training - Train once in dev, promote everywhere No Artifact Duplication - All catalogs reference same S3 files Fast Promotion - Metadata copy takes seconds vs hours of retraining Consistent Models - Exact same model binary across all environments Cost Efficient - No redundant compute for retraining Audit Trail - Clear lineage from dev → staging → prod Easy Rollback - Promote previous versions if issues arise

Job Distribution by Environment

Environment
Training Jobs
Promotion Jobs
Pipelines
Purpose

Sandbox

No

No

Yes

Developer testing

Dev

Yes

No

Yes

Model training source

Staging

No

Yes

Yes

Model validation

Prod

No

Yes

Yes

Model serving

Permissions Model

Principal
Sandbox Catalogs
Dev Catalog
Staging Catalog
Prod Catalog

Databricks - Dev group

CREATE CATALOG + FULL OWNERSHIP of own

SELECT (read)

SELECT (read)

None

Service Principal

None

ALL PRIVILEGES

ALL PRIVILEGES

ALL PRIVILEGES

Databricks - Account Admin group

ALL PRIVILEGES

ALL PRIVILEGES

ALL PRIVILEGES

ALL PRIVILEGES


Implementation Phases

Phase 1: Infrastructure Setup (Terraform) CRITICAL FIRST

Objective: Create proper workspace separation and sandbox catalog permissions

1.1 Fix Production Workspace Separation

Location: /Users/taylorlaing/Development/refresh-os/infra-core/stacks/ml-databricks/

Changes Needed:

Action Items:

1.2 Grant Sandbox Catalog Creation Permissions

Location: /Users/taylorlaing/Development/refresh-os/infra-core/stacks/ml-databricks/main.tf

Add New Resource:

Why Terraform?: Catalog permissions are foundational infrastructure that shouldn't change per deployment

1.3 Developer Groups & Service Principals CONFIRMED

Existing Groups (managed via Okta SCIM - no Terraform needed):

  • Databricks - Account Admin

  • Databricks - Dev (for sandbox developers)

  • Databricks - Staging

  • Databricks - Prod

  • Databricks - Prod - ReadOnly

Service Principals to Create:

  • ml-pipelines-dev-service-principal

  • ml-pipelines-staging-service-principal

  • ml-pipelines-prod-service-principal

See: SERVICE_PRINCIPAL_SETUP.mdarrow-up-right for creation steps

Terraform Data Sources (add to main.tf):

1.4 Grant Service Principal Catalog Permissions

Location: /Users/taylorlaing/Development/refresh-os/infra-core/stacks/ml-databricks/main.tf

Add After Unity Catalog Module:

Why: Service principals need permissions to write pipeline data to catalogs. Using count ensures only the target workspace gets grants during progressive deployment.


Phase 2: Databricks Bundle Configuration (databricks.yml)

Objective: Implement 4-tier deployment targets with proper data isolation

2.1 Update Bundle Variables

File: /Users/taylorlaing/Development/refresh-os/ml-pipelines/databricks.yml

Current:

Updated:

Changes:

  • Rename catalogcatalog_name for clarity

  • Remove service_principal_id variable (use names directly in run_as blocks)

2.2 Add Sandbox Target (NEW)

Add to databricks.yml:

Key Feature: ${workspace.current_user.short_name}_sandbox dynamically creates taylor_sandbox or william_sandbox

2.3 Update Dev Target (Shared Integration)

Replace existing dev target:

Change Summary:

  • Uses profile: ref-dev instead of hardcoded host

  • Uses environment-specific service principal: ml-pipelines-dev-service-principal

  • Uses shared dev catalog

2.4 Update Staging Target

Replace existing staging target:

Changes:

  • Use profile instead of hardcoded host

  • Use staging-specific service principal

2.5 Update Production Target

Replace existing prod target:

Critical Changes:

  • Profile points to correct prod workspace (dbc-028d9e53-7ce6)

  • Use prod-specific service principal

2.6 Update Pipeline Resource References

File: All pipeline YAML files in resources/pipelines/**/*.yml

Current Example (resources/pipelines/bronze/data_ingestion/run_bronze_data_ingestion.pipeline.yml):

Updated:

Action Items:


Phase 3: GitHub Actions Workflows

Objective: Fix deployment commands and add proper target specifications

3.1 Update Dev Deployment Workflow

File: .github/workflows/ml_pipelines_dev_deploy.yml

Current:

Updated:

Changes:

  • Fix command: databricks deploy --environmentdatabricks bundle deploy -t

  • Update Client ID to new dev service principal (Taylor will provide after creation)

3.2 Update Staging Deployment Workflow

File: .github/workflows/ml_pipelines_staging_deploy.yml

Updated:

Changes: Same as dev - fix command and use staging-specific service principal

3.3 Update Production Deployment Workflow

File: .github/workflows/ml_pipelines_prod_deploy.yml

Updated:

Critical Changes:

  1. Fix deployment command

  2. Correct prod host to dbc-028d9e53-7ce6 (was pointing to dev)

  3. Standardize Python version to 3.13

  4. Use prod-specific service principal


Phase 4: Code & Notebook Updates

Objective: Ensure all code references use new variable naming

4.1 Update Pipeline Notebooks

Location: All .py notebooks in resources/pipelines/

Pattern to Find:

Should Be:

Action Items:

4.2 Update MLflow Model Registration

Location: resources/jobs/model_registration/**/*.yml

Current Pattern:

Add Configuration:

Why: Model registration should use correct catalog (taylor_sandbox.models vs dev.models)


Detailed Changes Required

Summary Checklist

Terraform Changes (/Users/taylorlaing/Development/refresh-os/infra-core/stacks/ml-databricks/)

databricks.yml Changes

Pipeline YAML Changes (resources/**/*.yml)

GitHub Workflows (.github/workflows/*.yml)

Notebook Code Changes


Open Questions & Decisions Needed

🔴 Critical Decisions

Q1: Production Workspace Confirmation

Question: Is dbc-028d9e53-7ce6 the correct production workspace? Current State: Prod workflow points to dbc-028d9e53-7ce6 but databricks.yml points to dbc-a72d6af9-df3d (dev workspace) Action: Please verify which workspace should be production Options:

  • A) dbc-028d9e53-7ce6 (from workflow file) Recommended

  • B) Create new separate prod workspace

  • C) Keep sharing dev workspace (not recommended for compliance)

Q2: Developer Group Existence

Question: Does the "Databricks - Dev" group already exist with Taylor and William as members? Verification Command: databricks groups list --profile ref-admin If No: Add Terraform resources to create groups If Yes: Confirm group name and membership

Q3: Sandbox Catalog Lifecycle

Question: How should sandbox catalogs be cleaned up? Options:

  • A) Manual cleanup by developers (simplest)

  • B) Weekly automated cleanup via scheduled job

  • C) Auto-delete after 30 days of inactivity

  • D) No cleanup (low cost anyway)

Recommendation: Start with manual (A), add automation later if needed

Q4: Existing Pipeline Migration

Question: What should happen to existing pipelines when we deploy? Current State:

  • Taylor: [dev taylor] bronze-ingestion-dev (12 pipelines)

  • William: ref_dev_william_bronze-ingestion-dev (4 pipelines)

Options:

  • A) Delete all, redeploy with new naming

  • B) Keep existing, new deployments use new structure

  • C) Rename existing to match new pattern

Recommendation: Option B for safety, then manual cleanup


Non-Critical Questions

Q5: Staging Auto-Promotion

Question: Should staging auto-deploy after successful dev deployment? Current: Staging deploys on dev workflow completion Options:

  • A) Keep auto-promotion (faster, less control)

  • B) Require manual approval for staging (safer)

Recommendation: B for better control

Q6: Sandbox Resource Limits

Question: Should we limit sandbox catalog sizes? Concern: Prevent accidental large data writes to sandbox Options:

  • A) No limits (trust developers)

  • B) Quota limits via Databricks API

  • C) Monitoring alerts only

Recommendation: C (alerts) initially

Q7: Dev Catalog Write Access

Question: Should developers have write access to shared dev catalog? Current Proposal: Only service principal writes, developers read Alternative: Allow developers to write to dev for hotfixes

Recommendation: Service principal only (forces proper PR workflow)


Testing & Validation

Phase 1: Local Sandbox Testing

Test 1.1: Developer Deploys Sandbox

Tester: Taylor or William Commands:

Expected Outcome:

  • Catalog taylor_sandbox or william_sandbox created

  • Schemas: taylor_sandbox.bronze, taylor_sandbox.silver, taylor_sandbox.gold

  • Pipeline: [dev taylor] bronze_data_ingestion_sandbox or similar

Test 1.2: Parallel Sandbox Deployments

Testers: Both Taylor and William simultaneously Commands: Both run databricks bundle deploy -t sandbox at same time Expected Outcome: No conflicts, each gets their own catalog

Test 1.3: Sandbox Data Isolation

Tester: Taylor Commands:

Expected Outcome: Complete isolation between sandbox catalogs


Phase 2: CI/CD Integration Testing

Test 2.1: Dev Deployment (CI/CD)

Trigger: Push to main branch Verification:

Expected Outcome:

  • Pipeline names prefixed with dev_ (e.g., dev_bronze_ingestion_dev)

  • Data written to dev.bronze.messages, not sandbox catalogs

  • Service principal is owner of resources

Test 2.2: Staging Deployment

Trigger: Manual approval after dev deployment Verification:

Expected Outcome:

  • Resources in staging workspace

  • Data in staging catalog

Test 2.3: Production Deployment

Trigger: Manual production deployment Verification:

Expected Outcome:

  • Resources in CORRECT prod workspace (dbc-028d9e53-7ce6)

  • Data in prod catalog

  • No dev/staging resources visible


Phase 3: Permissions Testing

Test 3.1: Developer Permissions

Tester: William (non-admin developer) Tests:

Expected Outcome: Developers can only read dev, write to own sandbox

Test 3.2: Service Principal Permissions

Tester: Via GitHub Actions Tests: CI/CD workflows should successfully deploy and run pipelines in dev/staging/prod Expected Outcome: Service principal has full access to all environments


Rollback Strategy

Scenario 1: Sandbox Deployment Breaks

Symptoms: Developers can't deploy to sandbox Rollback:

Impact: Low - Only affects new sandbox feature, existing dev still works

Scenario 2: CI/CD Deployment Fails

Symptoms: GitHub Actions workflow fails on dev deployment Rollback:

Impact: Medium - Blocks CI/CD but developers can still use sandbox

Scenario 3: Production Workspace Change Fails

Symptoms: Prod deployment fails after workspace host change Rollback:

Impact: High - Prod deployment blocked, but existing prod resources unaffected

Scenario 4: Permissions Misconfiguration

Symptoms: Developers can't create sandbox catalogs Rollback:

Impact: Medium - Sandbox feature unavailable, dev/staging/prod unaffected


Implementation Timeline

Week 1: Infrastructure Foundation

Days 1-2: Terraform changes (workspace separation, permissions) Days 3-4: Validate Terraform in dev environment Day 5: Deploy Terraform to staging/prod

Week 2: Bundle Configuration

Days 1-2: Update databricks.yml with new targets Days 3-4: Update pipeline YAMLs and workflows Day 5: Local testing with sandbox target

Week 3: Testing & Migration

Days 1-2: Developer sandbox testing (Taylor & William) Days 3-4: CI/CD integration testing Day 5: Production migration and validation

Week 4: Cleanup & Documentation

Days 1-2: Clean up old pipelines and resources Days 3-4: Update team documentation Day 5: Post-implementation review


Next Steps

  1. Answer Critical Questions (see Open Questions)

  2. Review This Plan - Flag any concerns or missed items

  3. Start Phase 1 - Terraform infrastructure changes

  4. Iterate - Implement phases sequentially with validation


Appendix A: Variable Reference

databricks.yml Variables

Variable
Sandbox
Dev
Staging
Prod

catalog_name

${user}_sandbox

dev

staging

prod

environment

sandbox

dev

staging

prod

s3_bucket

ref-ml-core-dev-workspace-bucket

ref-ml-core-dev-workspace-bucket

ref-ml-core-staging-workspace-bucket

ref-ml-core-prod-workspace-bucket

service_principal_id

N/A

f920d175-cf7c-43a2-a6c0-e9ccb42c02d2

f920d175-cf7c-43a2-a6c0-e9ccb42c02d2

f920d175-cf7c-43a2-a6c0-e9ccb42c02d2

Workspace Reference

Environment
Workspace ID
Host
Profile

Dev

dbc-a72d6af9-df3d

https://dbc-a72d6af9-df3d.cloud.databricks.com

ref-dev

Staging

dbc-fab2a42a-8d11

https://dbc-fab2a42a-8d11.cloud.databricks.com

ref-staging

Prod

dbc-028d9e53-7ce6

https://dbc-028d9e53-7ce6.cloud.databricks.com

ref-prod


Appendix B: Commands Reference

Bundle Management

Catalog Management

Pipeline Management


Document Version: 1.0 Last Updated: 2025-09-30 Author: Claude Code Status: Ready for Review

Last updated