CI/CD Implementation Plan for ML Pipelines

Databricks Asset Bundles with Unity Catalog Data Isolation

Date: 2025-09-30 Status: Implementation Ready Architecture Pattern: Developer Sandbox → Shared Dev → Staging → Production

Executive Summary

Problem Statement

Current setup lacks proper developer isolation and has workspace configuration issues:

Dev and Prod share the same workspace (compliance/security risk)
No developer sandbox isolation (conflicts between Taylor & William)
Hardcoded hosts in databricks.yml (not using profiles)
GitHub workflows use incorrect deployment commands
No clear data isolation strategy between developers

Proposed Solution

Implement a 4-tier deployment strategy with Unity Catalog-based data isolation:

Sandbox (Local) - Developer-specific catalogs (taylor_sandbox, william_sandbox)
Dev (CI/CD) - Shared integration catalog (dev)
Staging (CI/CD) - Pre-production catalog (staging)
Production (CI/CD) - Production catalog (prod)

Benefits

Zero conflicts - Each developer has isolated sandbox catalog No waiting - Parallel development without blocking Security - Proper workspace isolation (dev/staging/prod separate) Compliance - Clear data lineage and promotion path Cost efficient - Sandbox catalogs auto-clean, serverless pipelines Industry standard - Follows Databricks best practices

Current State Analysis

Workspace Configuration CONFIRMED

Environment

Workspace ID

Current Host

Status

Dev

dbc-a72d6af9-df3d

https://dbc-a72d6af9-df3d.cloud.databricks.com

Correct

Staging

dbc-fab2a42a-8d11

https://dbc-fab2a42a-8d11.cloud.databricks.com

Correct

Prod

dbc-028d9e53-7ce6

https://dbc-028d9e53-7ce6.cloud.databricks.com

Confirmed (needs config update)

Note: databricks.yml currently has prod pointing to dev workspace - will be fixed in Phase 2

Unity Catalog Structure

DEV Workspace (dbc-a72d6af9-df3d):
├── dev (catalog)
│   ├── bronze (schema) - owned by [email protected]
│   ├── silver (schema) - owned by [email protected]
│   ├── gold (schema) - owned by [email protected]
│   └── models (schema) - owned by service principal
├── will_bronze (schema) - william's manual schema
└── No sandbox catalogs yet

STAGING Workspace (dbc-fab2a42a-8d11):
└── staging (catalog)
    ├── bronze, silver, gold, models (schemas)

PROD Workspace (should be dbc-028d9e53-7ce6):
└── prod (catalog)
    ├── bronze, silver, gold, models (schemas)

Databricks Profiles

Located in ~/.databrickscfg:

ref-dev → Dev workspace
ref-staging → Staging workspace
ref-prod → Prod workspace

Service Principals (Environment-Specific)

Current: Single github-service-principal (f920d175-cf7c-43a2-a6c0-e9ccb42c02d2)

New Architecture (see SERVICE_PRINCIPAL_SETUP.md):

ml-pipelines-dev-service-principal - For dev deployments
ml-pipelines-staging-service-principal - For staging deployments
ml-pipelines-prod-service-principal - For prod deployments

Benefit: Environment isolation, better audit trail, SOC2 compliance

Current Pipelines

Both developers deploying independently:

Taylor: [dev taylor] bronze-ingestion-dev, [dev taylor] sentiment-analysis-dev, etc.
William: ref_dev_william_bronze-ingestion-dev, ref_dev_william_emoji-analysis-dev, etc.

Issue: Inconsistent naming, no clear separation of sandbox vs shared dev

Target Architecture

Deployment Flow

┌─────────────────┐
│ Developer Local │
│   (Sandbox)     │  databricks bundle deploy -t sandbox
└────────┬────────┘
         │ Git Push to feature branch
         ▼
┌─────────────────┐
│   Pull Request  │
│    (Review)     │
└────────┬────────┘
         │ Merge to main
         ▼
┌─────────────────┐
│   Shared Dev    │  GitHub Actions: databricks bundle deploy -t dev
│   (Integration) │
└────────┬────────┘
         │ Manual approval or auto after X hours
         ▼
┌─────────────────┐
│    Staging      │  GitHub Actions: databricks bundle deploy -t staging
│  (Pre-prod)     │
└────────┬────────┘
         │ Manual production deployment approval
         ▼
┌─────────────────┐
│   Production    │  GitHub Actions: databricks bundle deploy -t prod
│   (Live)        │
└─────────────────┘

Data Isolation Strategy

Unity Catalog Hierarchy

Metastore (us-east-1)
├── taylor_sandbox (catalog) - Only Taylor can write
│   ├── bronze, silver, gold (schemas)
│   └── Auto-cleaned periodically
├── william_sandbox (catalog) - Only William can write
│   ├── bronze, silver, gold (schemas)
│   └── Auto-cleaned periodically
├── dev (catalog) - CI/CD writes, developers read
│   ├── bronze, silver, gold (schemas)
│   └── Shared integration environment
├── staging (catalog) - CI/CD only
│   ├── bronze, silver, gold (schemas)
│   └── Pre-production testing
└── prod (catalog) - CI/CD only, restricted access
    ├── bronze, silver, gold (schemas)
    └── Production data

MLOps Model Promotion Strategy

Shared Metastore Architecture

Key Insight: All workspaces (dev, staging, prod) share the same Unity Catalog metastore (4c140405-61ff-4e4e-9684-07078639218e). This enables efficient model promotion without retraining.

Model Training & Promotion Flow

┌─────────────────────────────────────────────────────────────┐
│ SANDBOX: Developers Test Registration Process              │
│ └─> taylorlaing_sandbox.models.sentiment_analysis          │
│     • Train toy models for testing                          │
│     • Validate registration scripts work                    │
│     • No promotion (isolated testing only)                  │
└─────────────────────────────────────────────────────────────┘
                              │
                    Git Push & PR Merge
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ DEV: Train Models ONCE (Source of Truth)                    │
│ └─> dev.models.sentiment_analysis                           │
│     • Full model training on dev data                       │
│     • Register to dev.models.* catalog                      │
│     • Store artifacts in Unity Catalog managed storage      │
│     • Artifacts: s3://.../unitystorage/models/<uuid>/      │
└─────────────────────────────────────────────────────────────┘
                              │
                  Model Promotion (Copy Metadata)
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ STAGING: Promote & Validate (NO Retraining)                 │
│ └─> staging.models.sentiment_analysis                       │
│     • Copy model from dev → staging catalog                 │
│     • SAME S3 artifacts (no duplication!)                   │
│     • Run validation tests on staging data                  │
│     • Performance monitoring                                │
└─────────────────────────────────────────────────────────────┘
                              │
                  Manual Approval + Promotion
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ PROD: Final Promotion (NO Retraining)                       │
│ └─> prod.models.sentiment_analysis                          │
│     • Copy model from staging → prod catalog               │
│     • SAME S3 artifacts as dev/staging                      │
│     • Continuous monitoring                                 │
│     • Rollback capability via model versions                │
└─────────────────────────────────────────────────────────────┘

Model Promotion Implementation

Python Code for Promotion (staging/prod jobs):

from mlflow.tracking import MlflowClient

client = MlflowClient()

# Staging: Promote from dev → staging
client.copy_model_version(
    src_model_uri="models:/dev.models.sentiment_analysis/1",
    dst_name="staging.models.sentiment_analysis"
)

# Prod: Promote from staging → prod
client.copy_model_version(
    src_model_uri="models:/staging.models.sentiment_analysis/1",
    dst_name="prod.models.sentiment_analysis"
)

Benefits of This Approach

No Duplicate Training - Train once in dev, promote everywhere No Artifact Duplication - All catalogs reference same S3 files Fast Promotion - Metadata copy takes seconds vs hours of retraining Consistent Models - Exact same model binary across all environments Cost Efficient - No redundant compute for retraining Audit Trail - Clear lineage from dev → staging → prod Easy Rollback - Promote previous versions if issues arise

Job Distribution by Environment

Environment

Training Jobs

Promotion Jobs

Pipelines

Purpose

Sandbox

Yes

Developer testing

Dev

Yes

Yes

Model training source

Staging

Yes

Yes

Model validation

Prod

Yes

Yes

Model serving

Permissions Model

Principal

Sandbox Catalogs

Dev Catalog

Staging Catalog

Prod Catalog

Databricks - Dev group

CREATE CATALOG + FULL OWNERSHIP of own

SELECT (read)

None

Service Principal

None

ALL PRIVILEGES

Databricks - Account Admin group

ALL PRIVILEGES

Implementation Phases

Phase 1: Infrastructure Setup (Terraform) CRITICAL FIRST

Objective: Create proper workspace separation and sandbox catalog permissions

1.1 Fix Production Workspace Separation

Location: /Users/taylorlaing/Development/refresh-os/infra-core/stacks/ml-databricks/

Changes Needed:

# main.tf - Update prod workspace configuration
# Current issue: prod uses dbc-a72d6af9-df3d (same as dev)
# Should use: dbc-028d9e53-7ce6 (from workflow file)

Action Items:

Verify prod workspace dbc-028d9e53-7ce6 exists and is accessible
Update Terraform to deploy prod resources to correct workspace
Run progressive deployment: terraform apply -target=module.databricks_workspaces["prod"]

1.2 Grant Sandbox Catalog Creation Permissions

Location: /Users/taylorlaing/Development/refresh-os/infra-core/stacks/ml-databricks/main.tf

Add New Resource:

# Grant developers CREATE CATALOG permission in metastore
resource "databricks_grants" "metastore_developer_grants" {
  provider   = databricks.mws
  metastore  = local.metastore_id

  grant {
    principal  = "Databricks - Dev"  # Developer group
    privileges = ["CREATE_CATALOG"]
  }
}

# Grant developers ability to read shared dev catalog
resource "databricks_grants" "dev_catalog_read" {
  provider = databricks.dev
  catalog  = "dev"

  grant {
    principal  = "Databricks - Dev"
    privileges = ["USE_CATALOG", "USE_SCHEMA", "SELECT"]
  }
}

Why Terraform?: Catalog permissions are foundational infrastructure that shouldn't change per deployment

1.3 Developer Groups & Service Principals CONFIRMED

Existing Groups (managed via Okta SCIM - no Terraform needed):

Databricks - Account Admin
Databricks - Dev (for sandbox developers)
Databricks - Staging
Databricks - Prod
Databricks - Prod - ReadOnly

Service Principals to Create:

ml-pipelines-dev-service-principal
ml-pipelines-staging-service-principal
ml-pipelines-prod-service-principal

See: SERVICE_PRINCIPAL_SETUP.md for creation steps

Terraform Data Sources (add to main.tf):

# Look up service principals by name (after Taylor creates them)
data "databricks_service_principal" "ml_pipelines_dev" {
  provider     = databricks.mws
  display_name = "ml-pipelines-dev-service-principal"
}

data "databricks_service_principal" "ml_pipelines_staging" {
  provider     = databricks.mws
  display_name = "ml-pipelines-staging-service-principal"
}

data "databricks_service_principal" "ml_pipelines_prod" {
  provider     = databricks.mws
  display_name = "ml-pipelines-prod-service-principal"
}

1.4 Grant Service Principal Catalog Permissions

Location: /Users/taylorlaing/Development/refresh-os/infra-core/stacks/ml-databricks/main.tf

Add After Unity Catalog Module:

# Grant service principals full access to their respective catalogs
# This allows CI/CD deployments to create tables, run pipelines, etc.

resource "databricks_grants" "dev_catalog_sp_grants" {
  count    = var.target_workspace == "dev" ? 1 : 0
  provider = databricks.workspace
  catalog  = "dev"

  grant {
    principal  = data.databricks_service_principal.ml_pipelines_dev.application_id
    privileges = ["ALL_PRIVILEGES"]
  }
}

resource "databricks_grants" "staging_catalog_sp_grants" {
  count    = var.target_workspace == "staging" ? 1 : 0
  provider = databricks.workspace
  catalog  = "staging"

  grant {
    principal  = data.databricks_service_principal.ml_pipelines_staging.application_id
    privileges = ["ALL_PRIVILEGES"]
  }
}

resource "databricks_grants" "prod_catalog_sp_grants" {
  count    = var.target_workspace == "prod" ? 1 : 0
  provider = databricks.workspace
  catalog  = "prod"

  grant {
    principal  = data.databricks_service_principal.ml_pipelines_prod.application_id
    privileges = ["ALL_PRIVILEGES"]
  }
}

Why: Service principals need permissions to write pipeline data to catalogs. Using count ensures only the target workspace gets grants during progressive deployment.

Phase 2: Databricks Bundle Configuration (databricks.yml)

Objective: Implement 4-tier deployment targets with proper data isolation

2.1 Update Bundle Variables

File: /Users/taylorlaing/Development/refresh-os/ml-pipelines/databricks.yml

Current:

variables:
  catalog:
    description: "Databricks catalog name"
  environment:
    description: "Environment (dev/staging/prod)"
  s3_bucket:
    description: "S3 bucket for external volumes"

Updated:

variables:
  catalog_name:
    description: "Unity Catalog name for current deployment"
  environment:
    description: "Environment (sandbox/dev/staging/prod)"
  s3_bucket:
    description: "S3 bucket for external volumes"

Changes:

Rename catalog → catalog_name for clarity
Remove service_principal_id variable (use names directly in run_as blocks)

2.2 Add Sandbox Target (NEW)

Add to databricks.yml:

targets:
  # Target 1: Developer Sandbox (Local Development)
  sandbox:
    mode: development  # Auto-prefixes with [DEV <user>], pauses schedules
    default: true      # Default when no target specified
    workspace:
      profile: ref-dev  # Uses local profile from ~/.databrickscfg
    variables:
      catalog_name: "${workspace.current_user.short_name}_sandbox"  # e.g., taylor_sandbox
      environment: sandbox
      s3_bucket: ref-ml-core-dev-workspace-bucket  # Use dev bucket for sandbox
    presets:
      pipelines_development: true
      trigger_pause_status: PAUSED
      jobs_max_concurrent_runs: 10
      artifacts_dynamic_version: true
      tags:
        environment: sandbox
        owner: ${workspace.current_user.short_name}
        managed_by: developer

Key Feature: ${workspace.current_user.short_name}_sandbox dynamically creates taylor_sandbox or william_sandbox

2.3 Update Dev Target (Shared Integration)

Replace existing dev target:

  # Target 2: Shared Dev (CI/CD Integration)
  dev:
    mode: production   # Stable, no [DEV] prefix
    workspace:
      profile: ref-dev
      root_path: /Shared/ci_cd_bundles/${bundle.name}/dev
    run_as:
      service_principal_name: "ml-pipelines-dev-service-principal"  # Environment-specific SP
    variables:
      catalog_name: dev  # Shared 'dev' catalog
      environment: dev
      s3_bucket: ref-ml-core-dev-workspace-bucket
    presets:
      name_prefix: "dev_"  # Pipelines: dev_bronze_ingestion_dev
      artifacts_dynamic_version: true
      tags:
        environment: dev
        managed_by: cicd

Change Summary:

Uses profile: ref-dev instead of hardcoded host
Uses environment-specific service principal: ml-pipelines-dev-service-principal
Uses shared dev catalog

2.4 Update Staging Target

Replace existing staging target:

  # Target 3: Staging (CI/CD Pre-production)
  staging:
    mode: production
    workspace:
      profile: ref-staging  # Use profile instead of hardcoded host
      root_path: /Shared/ci_cd_bundles/${bundle.name}/staging
    run_as:
      service_principal_name: "ml-pipelines-staging-service-principal"
    git:
      branch: main
    variables:
      catalog_name: staging
      environment: staging
      s3_bucket: ref-ml-core-staging-workspace-bucket
    presets:
      name_prefix: "staging_"
      trigger_pause_status: PAUSED
      jobs_max_concurrent_runs: 10
      artifacts_dynamic_version: true
      tags:
        environment: staging
        managed_by: cicd

Changes:

Use profile instead of hardcoded host
Use staging-specific service principal

2.5 Update Production Target

Replace existing prod target:

  # Target 4: Production (CI/CD Live)
  prod:
    mode: production
    workspace:
      profile: ref-prod  # Use profile (points to dbc-028d9e53-7ce6)
      root_path: /Shared/ci_cd_bundles/${bundle.name}/prod
    run_as:
      service_principal_name: "ml-pipelines-prod-service-principal"
    git:
      branch: main
    variables:
      catalog_name: prod
      environment: prod
      s3_bucket: ref-ml-core-prod-workspace-bucket
    presets:
      name_prefix: "prod_"
      trigger_pause_status: UNPAUSED  # Only prod runs on schedule
      jobs_max_concurrent_runs: 5
      artifacts_dynamic_version: true
      tags:
        environment: prod
        managed_by: cicd

Critical Changes:

Profile points to correct prod workspace (dbc-028d9e53-7ce6)
Use prod-specific service principal

2.6 Update Pipeline Resource References

File: All pipeline YAML files in resources/pipelines/**/*.yml

Current Example (resources/pipelines/bronze/data_ingestion/run_bronze_data_ingestion.pipeline.yml):

resources:
  pipelines:
    bronze_data_ingestion:
      name: "bronze_data_ingestion_${var.environment}"
      catalog: ${var.catalog}  # Old variable name
      target: bronze

Updated:

resources:
  pipelines:
    bronze_data_ingestion:
      name: "bronze_data_ingestion_${var.environment}"
      catalog: ${var.catalog_name}  # New variable name
      target: bronze
      configuration:
        "CATALOG": "${var.catalog_name}"      # Pass to notebook
        "ENVIRONMENT": "${var.environment}"

Action Items:

Update ALL pipeline YAML files: Find/replace ${var.catalog} → ${var.catalog_name}
Verify configuration block passes variables to notebooks

Phase 3: GitHub Actions Workflows

Objective: Fix deployment commands and add proper target specifications

3.1 Update Dev Deployment Workflow

File: .github/workflows/ml_pipelines_dev_deploy.yml

Current:

- name: Deploy to Databricks Dev
  run: |
    databricks deploy --environment dev  # Wrong command
  env:
    DATABRICKS_AUTH_TYPE: github-oidc
    DATABRICKS_HOST: https://dbc-a72d6af9-df3d.cloud.databricks.com/
    DATABRICKS_CLIENT_ID: 90a79e25-1700-44e8-8e94-1aba556fb11b  # Old github-service-principal

Updated:

- name: Deploy to Databricks Dev
  run: |
    databricks bundle deploy -t dev  # Correct command
  env:
    DATABRICKS_AUTH_TYPE: github-oidc
    DATABRICKS_HOST: https://dbc-a72d6af9-df3d.cloud.databricks.com
    DATABRICKS_CLIENT_ID: <ML_PIPELINES_DEV_SP_APPLICATION_ID>  # Replace after creating SP

Changes:

Fix command: databricks deploy --environment → databricks bundle deploy -t
Update Client ID to new dev service principal (Taylor will provide after creation)

3.2 Update Staging Deployment Workflow

File: .github/workflows/ml_pipelines_staging_deploy.yml

Updated:

- name: Deploy to Databricks Staging
  run: |
    databricks bundle deploy -t staging  # Fixed
  env:
    DATABRICKS_AUTH_TYPE: github-oidc
    DATABRICKS_HOST: https://dbc-fab2a42a-8d11.cloud.databricks.com
    DATABRICKS_CLIENT_ID: <ML_PIPELINES_STAGING_SP_APPLICATION_ID>  # Replace after creating SP

Changes: Same as dev - fix command and use staging-specific service principal

3.3 Update Production Deployment Workflow

File: .github/workflows/ml_pipelines_prod_deploy.yml

Updated:

- name: Set up Python
  uses: actions/setup-python@v4
  with:
    python-version: '3.13'  # Standardize to 3.13 (was 3.11)

- name: Deploy to Databricks Prod
  run: |
    databricks bundle deploy -t prod  # Fixed command
  env:
    DATABRICKS_AUTH_TYPE: github-oidc
    DATABRICKS_HOST: https://dbc-028d9e53-7ce6.cloud.databricks.com  # CORRECTED HOST
    DATABRICKS_CLIENT_ID: <ML_PIPELINES_PROD_SP_APPLICATION_ID>  # Replace after creating SP

Critical Changes:

Fix deployment command
Correct prod host to dbc-028d9e53-7ce6 (was pointing to dev)
Standardize Python version to 3.13
Use prod-specific service principal

Phase 4: Code & Notebook Updates

Objective: Ensure all code references use new variable naming

4.1 Update Pipeline Notebooks

Location: All .py notebooks in resources/pipelines/

Pattern to Find:

# Old pattern - may exist in notebooks
catalog = spark.conf.get("CATALOG", "dev")

Should Be:

# Use the CATALOG config passed from pipeline YAML
catalog = spark.conf.get("CATALOG")  # Will be taylor_sandbox, dev, staging, or prod
environment = spark.conf.get("ENVIRONMENT")

# Reference tables dynamically
bronze_table = f"{catalog}.bronze.messages"
silver_table = f"{catalog}.silver.features"

Action Items:

Audit all notebook files for hardcoded catalog references
Ensure notebooks read CATALOG from config
Test that catalog switching works in sandbox

4.2 Update MLflow Model Registration

Location: resources/jobs/model_registration/**/*.yml

Current Pattern:

# Jobs that register models to Unity Catalog
environment:
  dependencies:
    - ${workspace.root_path}/artifacts/.internal/ml_pipelines-*.whl

Add Configuration:

environment:
  dependencies:
    - ${workspace.root_path}/artifacts/.internal/ml_pipelines-*.whl
spark_python_task:
  python_file: ./register_model.py
  parameters:
    - "--catalog"
    - "${var.catalog_name}"
    - "--environment"
    - "${var.environment}"

Why: Model registration should use correct catalog (taylor_sandbox.models vs dev.models)

Detailed Changes Required

Summary Checklist

Terraform Changes (`/Users/taylorlaing/Development/refresh-os/infra-core/stacks/ml-databricks/`)

Fix prod workspace to use dbc-028d9e53-7ce6
Add metastore-level CREATE CATALOG grant for developer group
Add dev catalog read grants for developers
Verify/create developer groups (Taylor, William)
Deploy Terraform changes progressively (dev → staging → prod)

databricks.yml Changes

Rename variable catalog → catalog_name
Add service_principal_id variable
Add new sandbox target with dynamic catalog naming
Update dev target to use profile and service principal
Update staging target to use profile
Update prod target to use profile and correct workspace
Validate bundle: databricks bundle validate -t sandbox

Pipeline YAML Changes (`resources/**/*.yml`)

Update all pipeline files: ${var.catalog} → ${var.catalog_name}
Add configuration blocks to pass CATALOG and ENVIRONMENT to notebooks
Update model registration jobs with catalog parameters

GitHub Workflows (`.github/workflows/*.yml`)

Dev: Fix command to databricks bundle deploy -t dev
Staging: Fix command to databricks bundle deploy -t staging
Prod: Fix command to databricks bundle deploy -t prod
Prod: Update host to https://dbc-028d9e53-7ce6.cloud.databricks.com
Prod: Standardize Python version to 3.13

Notebook Code Changes

Audit notebooks for hardcoded catalog names
Ensure notebooks use spark.conf.get("CATALOG")
Test catalog variable propagation

Open Questions & Decisions Needed

🔴 Critical Decisions

Q1: Production Workspace Confirmation

Question: Is dbc-028d9e53-7ce6 the correct production workspace? Current State: Prod workflow points to dbc-028d9e53-7ce6 but databricks.yml points to dbc-a72d6af9-df3d (dev workspace) Action: Please verify which workspace should be production Options:

A) dbc-028d9e53-7ce6 (from workflow file) Recommended
B) Create new separate prod workspace
C) Keep sharing dev workspace (not recommended for compliance)

Q2: Developer Group Existence

Question: Does the "Databricks - Dev" group already exist with Taylor and William as members? Verification Command: databricks groups list --profile ref-admin If No: Add Terraform resources to create groups If Yes: Confirm group name and membership

Q3: Sandbox Catalog Lifecycle

Question: How should sandbox catalogs be cleaned up? Options:

A) Manual cleanup by developers (simplest)
B) Weekly automated cleanup via scheduled job
C) Auto-delete after 30 days of inactivity
D) No cleanup (low cost anyway)

Recommendation: Start with manual (A), add automation later if needed

Q4: Existing Pipeline Migration

Question: What should happen to existing pipelines when we deploy? Current State:

Taylor: [dev taylor] bronze-ingestion-dev (12 pipelines)
William: ref_dev_william_bronze-ingestion-dev (4 pipelines)

Options:

A) Delete all, redeploy with new naming
B) Keep existing, new deployments use new structure
C) Rename existing to match new pattern

Recommendation: Option B for safety, then manual cleanup

Non-Critical Questions

Q5: Staging Auto-Promotion

Question: Should staging auto-deploy after successful dev deployment? Current: Staging deploys on dev workflow completion Options:

A) Keep auto-promotion (faster, less control)
B) Require manual approval for staging (safer)

Recommendation: B for better control

Q6: Sandbox Resource Limits

Question: Should we limit sandbox catalog sizes? Concern: Prevent accidental large data writes to sandbox Options:

A) No limits (trust developers)
B) Quota limits via Databricks API
C) Monitoring alerts only

Recommendation: C (alerts) initially

Q7: Dev Catalog Write Access

Question: Should developers have write access to shared dev catalog? Current Proposal: Only service principal writes, developers read Alternative: Allow developers to write to dev for hotfixes

Recommendation: Service principal only (forces proper PR workflow)

Testing & Validation

Phase 1: Local Sandbox Testing

Test 1.1: Developer Deploys Sandbox

Tester: Taylor or William Commands:

cd /Users/taylorlaing/Development/refresh-os/ml-pipelines

# Validate bundle configuration
databricks bundle validate -t sandbox

# Deploy to personal sandbox
databricks bundle deploy -t sandbox

# Verify catalog creation
databricks catalogs list --profile ref-dev | grep "$(whoami)_sandbox"

# Check pipeline names
databricks pipelines list-pipelines --profile ref-dev | grep "$(whoami)"

Expected Outcome:

Catalog taylor_sandbox or william_sandbox created
Schemas: taylor_sandbox.bronze, taylor_sandbox.silver, taylor_sandbox.gold
Pipeline: [dev taylor] bronze_data_ingestion_sandbox or similar

Test 1.2: Parallel Sandbox Deployments

Testers: Both Taylor and William simultaneously Commands: Both run databricks bundle deploy -t sandbox at same time Expected Outcome: No conflicts, each gets their own catalog

Test 1.3: Sandbox Data Isolation

Tester: Taylor Commands:

# Deploy pipeline to sandbox
databricks bundle deploy -t sandbox

# Run a pipeline
databricks pipelines start-update <pipeline_id> --profile ref-dev

# Query sandbox data
databricks sql execute "SELECT COUNT(*) FROM taylor_sandbox.bronze.messages" --profile ref-dev

# Verify William cannot see Taylor's sandbox data
databricks sql execute "SELECT COUNT(*) FROM william_sandbox.bronze.messages" --profile ref-dev
# Expected: Permission denied or catalog not found

Expected Outcome: Complete isolation between sandbox catalogs

Phase 2: CI/CD Integration Testing

Test 2.1: Dev Deployment (CI/CD)

Trigger: Push to main branch Verification:

# Check GitHub Actions workflow succeeds
# Verify deployed resources
databricks pipelines list-pipelines --profile ref-dev | grep "dev_"
databricks sql execute "SHOW SCHEMAS IN dev" --profile ref-dev

Expected Outcome:

Pipeline names prefixed with dev_ (e.g., dev_bronze_ingestion_dev)
Data written to dev.bronze.messages, not sandbox catalogs
Service principal is owner of resources

Test 2.2: Staging Deployment

Trigger: Manual approval after dev deployment Verification:

databricks pipelines list-pipelines --profile ref-staging | grep "staging_"
databricks sql execute "SELECT COUNT(*) FROM staging.bronze.messages" --profile ref-staging

Expected Outcome:

Resources in staging workspace
Data in staging catalog

Test 2.3: Production Deployment

Trigger: Manual production deployment Verification:

databricks pipelines list-pipelines --profile ref-prod | grep "prod_"
databricks sql execute "SELECT COUNT(*) FROM prod.bronze.messages" --profile ref-prod

Expected Outcome:

Resources in CORRECT prod workspace (dbc-028d9e53-7ce6)
Data in prod catalog
No dev/staging resources visible

Phase 3: Permissions Testing

Test 3.1: Developer Permissions

Tester: William (non-admin developer) Tests:

# Should succeed: Read from dev catalog
databricks sql execute "SELECT * FROM dev.bronze.messages LIMIT 10" --profile ref-dev

# Should succeed: Create own sandbox catalog
databricks bundle deploy -t sandbox

# Should fail: Write to dev catalog
databricks sql execute "INSERT INTO dev.bronze.messages VALUES (...)" --profile ref-dev

# Should fail: Access staging catalog
databricks sql execute "SELECT * FROM staging.bronze.messages" --profile ref-staging
# Expected: 403 Forbidden

# Should fail: Access prod catalog
databricks sql execute "SELECT * FROM prod.bronze.messages" --profile ref-prod
# Expected: 403 Forbidden or workspace not accessible

Expected Outcome: Developers can only read dev, write to own sandbox

Test 3.2: Service Principal Permissions

Tester: Via GitHub Actions Tests: CI/CD workflows should successfully deploy and run pipelines in dev/staging/prod Expected Outcome: Service principal has full access to all environments

Rollback Strategy

Scenario 1: Sandbox Deployment Breaks

Symptoms: Developers can't deploy to sandbox Rollback:

# Revert databricks.yml to previous version
git revert <commit>

# Deploy using old dev target
databricks bundle deploy -t dev

Impact: Low - Only affects new sandbox feature, existing dev still works

Scenario 2: CI/CD Deployment Fails

Symptoms: GitHub Actions workflow fails on dev deployment Rollback:

# Revert workflow files
git revert <commit>

# Manually deploy with old method (if needed)
databricks bundle deploy -t dev --var="catalog=dev"

Impact: Medium - Blocks CI/CD but developers can still use sandbox

Scenario 3: Production Workspace Change Fails

Symptoms: Prod deployment fails after workspace host change Rollback:

# databricks.yml - Revert prod workspace to old host
prod:
  workspace:
    host: https://dbc-a72d6af9-df3d.cloud.databricks.com  # Old (shared with dev)

Impact: High - Prod deployment blocked, but existing prod resources unaffected

Scenario 4: Permissions Misconfiguration

Symptoms: Developers can't create sandbox catalogs Rollback:

# Terraform - Remove metastore grants
# Comment out databricks_grants.metastore_developer_grants resource
terraform apply -target=databricks_grants.metastore_developer_grants

Impact: Medium - Sandbox feature unavailable, dev/staging/prod unaffected

Implementation Timeline

Week 1: Infrastructure Foundation

Days 1-2: Terraform changes (workspace separation, permissions) Days 3-4: Validate Terraform in dev environment Day 5: Deploy Terraform to staging/prod

Week 2: Bundle Configuration

Days 1-2: Update databricks.yml with new targets Days 3-4: Update pipeline YAMLs and workflows Day 5: Local testing with sandbox target

Week 3: Testing & Migration

Days 1-2: Developer sandbox testing (Taylor & William) Days 3-4: CI/CD integration testing Day 5: Production migration and validation

Week 4: Cleanup & Documentation

Days 1-2: Clean up old pipelines and resources Days 3-4: Update team documentation Day 5: Post-implementation review

Next Steps

Answer Critical Questions (see Open Questions)
Review This Plan - Flag any concerns or missed items
Start Phase 1 - Terraform infrastructure changes
Iterate - Implement phases sequentially with validation

Appendix A: Variable Reference

databricks.yml Variables

Variable

Sandbox

Dev

Staging

Prod

catalog_name

${user}_sandbox

dev

staging

prod

environment

sandbox

dev

staging

prod

s3_bucket

ref-ml-core-dev-workspace-bucket

ref-ml-core-staging-workspace-bucket

ref-ml-core-prod-workspace-bucket

service_principal_id

N/A

f920d175-cf7c-43a2-a6c0-e9ccb42c02d2

Workspace Reference

Environment

Workspace ID

Host

Profile

Dev

dbc-a72d6af9-df3d

https://dbc-a72d6af9-df3d.cloud.databricks.com

ref-dev

Staging

dbc-fab2a42a-8d11

https://dbc-fab2a42a-8d11.cloud.databricks.com

ref-staging

Prod

dbc-028d9e53-7ce6

https://dbc-028d9e53-7ce6.cloud.databricks.com

ref-prod

Appendix B: Commands Reference

Bundle Management

# Validate bundle configuration
databricks bundle validate -t <target>

# Deploy to specific target
databricks bundle deploy -t <sandbox|dev|staging|prod>

# Destroy resources for target
databricks bundle destroy -t <target>

Catalog Management

# List catalogs
databricks catalogs list --profile <profile>

# Show catalog permissions
databricks catalogs get <catalog_name> --profile <profile>

# List schemas in catalog
databricks schemas list <catalog_name> --profile <profile>

Pipeline Management

# List pipelines
databricks pipelines list-pipelines --profile <profile>

# Get pipeline details
databricks pipelines get <pipeline_id> --profile <profile>

# Start pipeline update
databricks pipelines start-update <pipeline_id> --profile <profile>

Document Version: 1.0 Last Updated: 2025-09-30 Author: Claude Code Status: Ready for Review

PreviousArchived Nextv1-model-flow-2025-09

Last updated 5 months ago

hashtagDatabricks Asset Bundles with Unity Catalog Data Isolation

hashtagTable of Contents

hashtagExecutive Summary

hashtagProblem Statement

hashtagProposed Solution

hashtagBenefits

hashtagCurrent State Analysis

hashtagWorkspace Configuration CONFIRMED

hashtagUnity Catalog Structure

hashtagDatabricks Profiles

hashtagService Principals (Environment-Specific)

hashtagCurrent Pipelines

hashtagTarget Architecture

hashtagDeployment Flow

hashtagData Isolation Strategy

hashtagUnity Catalog Hierarchy

hashtagMLOps Model Promotion Strategy

hashtagShared Metastore Architecture

hashtagModel Training & Promotion Flow

hashtagModel Promotion Implementation

hashtagBenefits of This Approach

hashtagJob Distribution by Environment

hashtagPermissions Model

hashtagImplementation Phases

hashtagPhase 1: Infrastructure Setup (Terraform) CRITICAL FIRST

hashtag1.1 Fix Production Workspace Separation

hashtag1.2 Grant Sandbox Catalog Creation Permissions

hashtag1.3 Developer Groups & Service Principals CONFIRMED

hashtag1.4 Grant Service Principal Catalog Permissions

hashtagPhase 2: Databricks Bundle Configuration (databricks.yml)

hashtag2.1 Update Bundle Variables

hashtag2.2 Add Sandbox Target (NEW)

hashtag2.3 Update Dev Target (Shared Integration)

hashtag2.4 Update Staging Target

hashtag2.5 Update Production Target

hashtag2.6 Update Pipeline Resource References

hashtagPhase 3: GitHub Actions Workflows

hashtag3.1 Update Dev Deployment Workflow

hashtag3.2 Update Staging Deployment Workflow

hashtag3.3 Update Production Deployment Workflow

hashtagPhase 4: Code & Notebook Updates

hashtag4.1 Update Pipeline Notebooks

hashtag4.2 Update MLflow Model Registration

hashtagDetailed Changes Required

hashtagSummary Checklist

hashtagTerraform Changes (/Users/taylorlaing/Development/refresh-os/infra-core/stacks/ml-databricks/)

hashtagdatabricks.yml Changes

hashtagPipeline YAML Changes (resources/**/*.yml)

hashtagGitHub Workflows (.github/workflows/*.yml)

hashtagNotebook Code Changes

hashtagOpen Questions & Decisions Needed

hashtag🔴 Critical Decisions

hashtagQ1: Production Workspace Confirmation

hashtagQ2: Developer Group Existence

hashtagQ3: Sandbox Catalog Lifecycle

hashtagQ4: Existing Pipeline Migration

hashtagNon-Critical Questions

hashtagQ5: Staging Auto-Promotion

hashtagQ6: Sandbox Resource Limits

hashtagQ7: Dev Catalog Write Access

hashtagTesting & Validation

hashtagPhase 1: Local Sandbox Testing

hashtagTest 1.1: Developer Deploys Sandbox

hashtagTest 1.2: Parallel Sandbox Deployments

hashtagTest 1.3: Sandbox Data Isolation

hashtagPhase 2: CI/CD Integration Testing

hashtagTest 2.1: Dev Deployment (CI/CD)

hashtagTest 2.2: Staging Deployment

hashtagTest 2.3: Production Deployment

hashtagPhase 3: Permissions Testing

hashtagTest 3.1: Developer Permissions

hashtagTest 3.2: Service Principal Permissions

hashtagRollback Strategy

hashtagScenario 1: Sandbox Deployment Breaks

hashtagScenario 2: CI/CD Deployment Fails

hashtagScenario 3: Production Workspace Change Fails

hashtagScenario 4: Permissions Misconfiguration

hashtagImplementation Timeline

hashtagWeek 1: Infrastructure Foundation

hashtagWeek 2: Bundle Configuration

Databricks Asset Bundles with Unity Catalog Data Isolation

Table of Contents

Executive Summary

Problem Statement

Proposed Solution

Benefits

Current State Analysis

Workspace Configuration CONFIRMED

Unity Catalog Structure

Databricks Profiles

Service Principals (Environment-Specific)

Current Pipelines

Target Architecture

Deployment Flow

Data Isolation Strategy

Unity Catalog Hierarchy

MLOps Model Promotion Strategy

Shared Metastore Architecture

Model Training & Promotion Flow

Model Promotion Implementation

Benefits of This Approach

Job Distribution by Environment

Permissions Model

Implementation Phases

Phase 1: Infrastructure Setup (Terraform) CRITICAL FIRST

1.1 Fix Production Workspace Separation

1.2 Grant Sandbox Catalog Creation Permissions

1.3 Developer Groups & Service Principals CONFIRMED

1.4 Grant Service Principal Catalog Permissions

Phase 2: Databricks Bundle Configuration (databricks.yml)

2.1 Update Bundle Variables

2.2 Add Sandbox Target (NEW)

2.3 Update Dev Target (Shared Integration)

2.4 Update Staging Target

2.5 Update Production Target

2.6 Update Pipeline Resource References

Phase 3: GitHub Actions Workflows

3.1 Update Dev Deployment Workflow

3.2 Update Staging Deployment Workflow

3.3 Update Production Deployment Workflow

Phase 4: Code & Notebook Updates

4.1 Update Pipeline Notebooks

4.2 Update MLflow Model Registration

Detailed Changes Required

Summary Checklist

Terraform Changes (`/Users/taylorlaing/Development/refresh-os/infra-core/stacks/ml-databricks/`)

databricks.yml Changes

Pipeline YAML Changes (`resources/**/*.yml`)

GitHub Workflows (`.github/workflows/*.yml`)

Notebook Code Changes

Open Questions & Decisions Needed

🔴 Critical Decisions

Q1: Production Workspace Confirmation

Q2: Developer Group Existence

Q3: Sandbox Catalog Lifecycle

Q4: Existing Pipeline Migration

Non-Critical Questions

Q5: Staging Auto-Promotion

Q6: Sandbox Resource Limits

Q7: Dev Catalog Write Access

Testing & Validation

Phase 1: Local Sandbox Testing

Test 1.1: Developer Deploys Sandbox

Test 1.2: Parallel Sandbox Deployments

Test 1.3: Sandbox Data Isolation

Phase 2: CI/CD Integration Testing

Test 2.1: Dev Deployment (CI/CD)

Test 2.2: Staging Deployment

Test 2.3: Production Deployment

Phase 3: Permissions Testing

Test 3.1: Developer Permissions

Test 3.2: Service Principal Permissions

Rollback Strategy

Scenario 1: Sandbox Deployment Breaks

Scenario 2: CI/CD Deployment Fails

Scenario 3: Production Workspace Change Fails

Scenario 4: Permissions Misconfiguration

Implementation Timeline

Week 1: Infrastructure Foundation

Week 2: Bundle Configuration