Model Promotion Architecture

Overview

The ML Pipelines project implements a rigorous MLOps workflow that ensures models are thoroughly tested before reaching production. The key principle is binary promotion: models are trained once in staging on production data, then the exact same model binary is promoted (copied) to production without retraining.

Related Topics:

Model Lifecycle

┌─────────────────────────────────────────────────────────────────┐
│ Stage 1: Experimentation (Sandbox)                              │
│   Environment: {username}_sandbox catalog                       │
│   Data: dev.bronze.*, dev.silver.* (read-only)                  │
│   Purpose: Rapid prototyping and model development              │
└─────────────────────────────────────────────────────────────────┘

                    [Code review & PR merge]

┌─────────────────────────────────────────────────────────────────┐
│ Stage 2: Development (Dev)                                      │
│   Environment: dev catalog                                      │
│   Data: dev.bronze.*, dev.silver.*, dev.gold.*                  │
│   Purpose: Shared testing and integration                       │
│   Models: Saved to dev.models schema                            │
└─────────────────────────────────────────────────────────────────┘

            [Model shows promise in shared environment]

┌─────────────────────────────────────────────────────────────────┐
│ Stage 3: Staging Training (Staging)                             │
│   Environment: staging catalog                                  │
│   Data: prod.bronze.*, prod.silver.*, prod.gold.* (READ-ONLY)   │
│   Purpose: Train final model on production data                 │
│   Models: Saved to staging.models schema                        │
│   Key: Model trained on PROD data for realistic validation      │
└─────────────────────────────────────────────────────────────────┘

                    [Validation & approval]

┌─────────────────────────────────────────────────────────────────┐
│ Stage 4: Production Promotion (Prod)                            │
│   Environment: prod catalog                                     │
│   Data: prod.bronze.*, prod.silver.*, prod.gold.*               │
│   Purpose: Serve production predictions                         │
│   Models: PROMOTED from staging.models (NO RETRAINING)          │
│   Key: Binary promotion ensures tested model runs in prod       │
└─────────────────────────────────────────────────────────────────┘

Key Principles

1. Train Once, Promote Binary

Why: Ensures the exact model tested in staging is what runs in production.

How:

  • Staging: Train model on prod.* data → Save to staging.models

  • Production: Copy model binary from staging.modelsprod.models

Benefits:

  • No surprises: Same binary tested in staging runs in prod

  • Cost efficient: Train once, not twice

  • Reproducibility: Exact model version is promoted

  • Faster deployments: No retraining time in prod


2. Staging Uses Production Data

Why: Validates model performance on real production data before promotion.

How:

  • Staging workspace reads from prod catalog (read-only)

  • Model training jobs in staging use prod.bronze.*, prod.silver.* tables

  • Staging workspace has separate staging catalog for experiments

Benefits:

  • Realistic validation: Test against actual production distribution

  • Catch data drift: Identify issues before prod deployment

  • Confidence: Know model will perform as expected


3. Version Control and Lineage

Why: Track model history, reproduce results, and enable rollbacks.

How:

  • MLflow model registry with Unity Catalog integration

  • Every model version logged with metadata

  • Model artifacts stored in Unity Catalog volumes

  • Git commit SHA tagged in model version

Benefits:

  • Audit trail: Complete history of model changes

  • Rollback capability: Easily revert to previous version

  • Reproducibility: Recreate any model version

  • Governance: Track who trained what and when


Model Registration Process

Step 1: Model Development (Sandbox)

Location: {username}_sandbox.models.*

Process:

  1. Developer experiments with model architecture in notebook

  2. Iterates on features, hyperparameters, and training data

  3. Tests model locally in sandbox environment

  4. Saves experimental models to {username}_sandbox.models

Example:

Goal: Validate model concept works before committing code.


Step 2: Code Integration (Dev)

Location: dev.models.*

Process:

  1. Developer moves model code from notebook to /src/models/internal/[model_name]/

  2. Creates model registration job in /resources/jobs/model_registration/

  3. Submits pull request with changes

  4. PR merged → CI/CD deploys to dev

  5. Model registration job runs in dev environment

  6. Model saved to dev.models schema

Registration Job Structure:

Job Configuration (register_sentiment_analysis.job.yml):

Registration Script (simplified):

Goal: Integrate model code into repository and validate in shared environment.


Step 3: Staging Training (Staging)

Location: staging.models.*

Process:

  1. CI/CD deploys code to staging workspace

  2. Model registration job runs in staging environment

  3. Job reads training data from prod.bronze.*, prod.silver.* (read-only)

  4. Model trained on production data

  5. Model saved to staging.models schema with metadata:

    • Git commit SHA

    • Training data timestamp

    • Model version number

    • Performance metrics

  6. Model serving endpoint created/updated in staging workspace

  7. Validation tests run against staging endpoint

Staging Service Principal Permissions:

Model Metadata (logged to MLflow):

Goal: Train final model on production data and validate performance.


Step 4: Production Promotion (Prod)

Location: prod.models.*

Process:

  1. Staging deployment completes successfully

  2. Validation tests pass in staging

  3. Manual approval granted for production deployment

  4. CI/CD deploys to production workspace

  5. Model binary promoted from staging.models to prod.models (NO RETRAINING)

  6. Model serving endpoint created/updated in production workspace

  7. Production jobs updated to use new model version

Promotion Methods:

Method 1: MLflow Model Registry Transition (recommended):

Method 2: Unity Catalog Model Sharing (if same metastore):

Goal: Deploy tested model binary to production without retraining.


Model Serving and Endpoints

Endpoint Lifecycle

Endpoint Configuration

Example Endpoint Creation:

Using Models in DLT Pipelines

ai_query() Function:

Configuration:


A/B Testing and Canary Deployments

A/B Testing Setup

Scenario: Test new model version against current production model.

Configuration:

Monitoring:

Promotion Decision:

  • Monitor metrics for 24-48 hours

  • If challenger_model (version 6) performs better, increase traffic to 50%, then 100%

  • If challenger_model performs worse, revert to 100% champion_model (version 5)

  • Once validated, update champion alias to point to version 6


Canary Deployment

Scenario: Gradually roll out new model version to minimize risk.

Steps:

  1. 10% traffic: Deploy new version, route 10% traffic

  2. Monitor for 6-12 hours: Check error rates, latency, accuracy

  3. 50% traffic: If metrics good, increase to 50%

  4. Monitor for 12-24 hours: Continue validation

  5. 100% traffic: Complete rollout

  6. Update champion alias: Point champion alias to new version 6


Model Versioning

Version Numbering

Important: DO NOT use version suffixes in model names (e.g., sentiment_v1, sentiment_v2). MLflow automatically handles versioning.

Correct Approach:

  • Model name: sentiment_analysis (no version suffix)

  • MLflow assigns versions: 1, 2, 3, 4, ...

  • Use aliases to mark special versions: champion, challenger, staging, production

Wrong Approach (do not do this):

Aliases

Standard Aliases:

  • champion: Current production model (actively serving predictions)

  • challenger: Model being tested in production (A/B test or canary)

  • staging: Latest model validated in staging environment

  • archive: Previous production model (for quick rollback)

Example Usage:

Model Tags

Best Practices: Tag models with metadata for tracking.


Rollback Procedures

Scenario 1: Model Performance Degradation

Symptoms: Production model accuracy drops, error rate increases

Steps:

  1. Identify previous good version:

  2. Update endpoint to use archive version:

  3. Update champion alias to archive version:

Timeline: 2-5 minutes


Scenario 2: Model Serving Endpoint Failure

Symptoms: Endpoint returning errors, not responding

Steps:

  1. Check endpoint status:

  2. Restart endpoint:

  3. If restart doesn't work, recreate endpoint with previous version

Timeline: 5-10 minutes


Scenario 3: Data Drift Detected

Symptoms: Model performance degrades gradually over time

Steps:

  1. Retrain model in staging on latest production data

  2. Validate new model version in staging

  3. Promote new model version to production

  4. Monitor performance improvement

Timeline: 1-2 hours (depends on training time)


Model Performance Monitoring

Metrics to Track

Accuracy Metrics:

  • Precision, Recall, F1-Score

  • Confusion matrix

Operational Metrics:

  • Prediction latency (p50, p95, p99)

  • Request throughput (requests/second)

  • Error rate (errors/total requests)

  • Model uptime

Data Quality Metrics:

  • Input data distribution

  • Feature drift detection

  • Null value rates

  • Out-of-range values

Monitoring Implementation

Log Model Predictions:

Drift Detection:


Last updated