Model Promotion Architecture

Overview

The ML Pipelines project implements a rigorous MLOps workflow that ensures models are thoroughly tested before reaching production. The key principle is binary promotion: models are trained once in staging on production data, then the exact same model binary is promoted (copied) to production without retraining.

Related Topics:

For model development and ai_query integration, see Model Deployment Guide
For CI/CD automation of model promotion, see CI/CD Pipeline Architecture
For monitoring production models, see Monitoring Guide

Model Lifecycle

┌─────────────────────────────────────────────────────────────────┐
│ Stage 1: Experimentation (Sandbox)                              │
│   Environment: {username}_sandbox catalog                       │
│   Data: dev.bronze.*, dev.silver.* (read-only)                  │
│   Purpose: Rapid prototyping and model development              │
└─────────────────────────────────────────────────────────────────┘
                              ↓
                    [Code review & PR merge]
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ Stage 2: Development (Dev)                                      │
│   Environment: dev catalog                                      │
│   Data: dev.bronze.*, dev.silver.*, dev.gold.*                  │
│   Purpose: Shared testing and integration                       │
│   Models: Saved to dev.models schema                            │
└─────────────────────────────────────────────────────────────────┘
                              ↓
            [Model shows promise in shared environment]
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ Stage 3: Staging Training (Staging)                             │
│   Environment: staging catalog                                  │
│   Data: prod.bronze.*, prod.silver.*, prod.gold.* (READ-ONLY)   │
│   Purpose: Train final model on production data                 │
│   Models: Saved to staging.models schema                        │
│   Key: Model trained on PROD data for realistic validation      │
└─────────────────────────────────────────────────────────────────┘
                              ↓
                    [Validation & approval]
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ Stage 4: Production Promotion (Prod)                            │
│   Environment: prod catalog                                     │
│   Data: prod.bronze.*, prod.silver.*, prod.gold.*               │
│   Purpose: Serve production predictions                         │
│   Models: PROMOTED from staging.models (NO RETRAINING)          │
│   Key: Binary promotion ensures tested model runs in prod       │
└─────────────────────────────────────────────────────────────────┘

Key Principles

1. Train Once, Promote Binary

Why: Ensures the exact model tested in staging is what runs in production.

How:

Staging: Train model on prod.* data → Save to staging.models
Production: Copy model binary from staging.models → prod.models

Benefits:

No surprises: Same binary tested in staging runs in prod
Cost efficient: Train once, not twice
Reproducibility: Exact model version is promoted
Faster deployments: No retraining time in prod

2. Staging Uses Production Data

Why: Validates model performance on real production data before promotion.

How:

Staging workspace reads from prod catalog (read-only)
Model training jobs in staging use prod.bronze.*, prod.silver.* tables
Staging workspace has separate staging catalog for experiments

Benefits:

Realistic validation: Test against actual production distribution
Catch data drift: Identify issues before prod deployment
Confidence: Know model will perform as expected

3. Version Control and Lineage

Why: Track model history, reproduce results, and enable rollbacks.

How:

MLflow model registry with Unity Catalog integration
Every model version logged with metadata
Model artifacts stored in Unity Catalog volumes
Git commit SHA tagged in model version

Benefits:

Audit trail: Complete history of model changes
Rollback capability: Easily revert to previous version
Reproducibility: Recreate any model version
Governance: Track who trained what and when

Model Registration Process

Step 1: Model Development (Sandbox)

Location: {username}_sandbox.models.*

Process:

Developer experiments with model architecture in notebook
Iterates on features, hyperparameters, and training data
Tests model locally in sandbox environment
Saves experimental models to {username}_sandbox.models

Example:

import mlflow

# Set registry to Unity Catalog
mlflow.set_registry_uri("databricks-uc")

# Log model to sandbox
# NOTE: No version suffix (_v1, _v2, etc.) - MLflow handles versioning automatically
with mlflow.start_run():
    mlflow.sklearn.log_model(
        model,
        artifact_path="model",
        registered_model_name="taylorlaing_sandbox.models.sentiment_analysis"  # No _v1 suffix
    )

# MLflow will assign version numbers automatically: version 1, 2, 3, etc.

Goal: Validate model concept works before committing code.

Step 2: Code Integration (Dev)

Location: dev.models.*

Process:

Developer moves model code from notebook to /src/models/internal/[model_name]/
Creates model registration job in /resources/jobs/model_registration/
Submits pull request with changes
PR merged → CI/CD deploys to dev
Model registration job runs in dev environment
Model saved to dev.models schema

Registration Job Structure:

resources/jobs/model_registration/internal/sentiment_analysis/
├── register_sentiment_analysis.job.yml  # Job definition
└── register_sentiment_analysis.py       # Registration script

Job Configuration (register_sentiment_analysis.job.yml):

resources:
  jobs:
    register_sentiment_analysis:
      name: "${var.resource_prefix}_register_sentiment_analysis"
      tasks:
        - task_key: register_sentiment_analysis
          notebook_task:
            notebook_path: ./register_sentiment_analysis.py
            base_parameters:
              catalog_name: ${var.catalog_name}
          job_cluster_key: register_model_job_cluster
      job_clusters:
        - job_cluster_key: register_model_job_cluster
          new_cluster:
            spark_version: 15.4.x-scala2.12
            node_type_id: i3.2xlarge
            autoscale:
              min_workers: 1
              max_workers: 1

Registration Script (simplified):

# register_sentiment_analysis.py
import mlflow
from models.internal.sentiment_analysis.model import SentimentAnalysisModel

# Get catalog from job parameter
catalog_name = dbutils.widgets.get("catalog_name")
model_name = f"{catalog_name}.models.sentiment_analysis"

# Set MLflow registry
mlflow.set_registry_uri("databricks-uc")

# Log and register model
with mlflow.start_run():
    mlflow.pyfunc.log_model(
        python_model=SentimentAnalysisModel(catalog_name),
        artifact_path="model",
        registered_model_name=model_name,
        signature=signature,
        pip_requirements=[...]
    )

# Set champion alias
client = mlflow.tracking.MlflowClient()
client.set_registered_model_alias(
    name=model_name,
    alias="champion",
    version=latest_version
)

Goal: Integrate model code into repository and validate in shared environment.

Step 3: Staging Training (Staging)

Location: staging.models.*

Process:

CI/CD deploys code to staging workspace
Model registration job runs in staging environment
Job reads training data from prod.bronze.*, prod.silver.* (read-only)
Model trained on production data
Model saved to staging.models schema with metadata:
- Git commit SHA
- Training data timestamp
- Model version number
- Performance metrics
Model serving endpoint created/updated in staging workspace
Validation tests run against staging endpoint

Staging Service Principal Permissions:

-- Staging SP has read-only access to prod data
GRANT USE CATALOG ON prod TO '93bda7cf-b009-49d8-8e8d-046677c8597e';
GRANT SELECT ON prod.bronze.* TO '93bda7cf-b009-49d8-8e8d-046677c8597e';
GRANT SELECT ON prod.silver.* TO '93bda7cf-b009-49d8-8e8d-046677c8597e';

-- Staging SP has full access to staging catalog
GRANT ALL PRIVILEGES ON CATALOG staging TO '93bda7cf-b009-49d8-8e8d-046677c8597e';

Model Metadata (logged to MLflow):

with mlflow.start_run():
    mlflow.log_param("git_commit", os.getenv("GITHUB_SHA"))
    mlflow.log_param("environment", "staging")
    mlflow.log_param("training_data_source", "prod.silver.messages")
    mlflow.log_param("training_data_timestamp", current_timestamp)
    mlflow.log_metric("accuracy", accuracy_score)
    mlflow.log_metric("f1_score", f1_score)

    mlflow.pyfunc.log_model(...)

Goal: Train final model on production data and validate performance.

Step 4: Production Promotion (Prod)

Location: prod.models.*

Process:

Staging deployment completes successfully
Validation tests pass in staging
Manual approval granted for production deployment
CI/CD deploys to production workspace
Model binary promoted from staging.models to prod.models (NO RETRAINING)
Model serving endpoint created/updated in production workspace
Production jobs updated to use new model version

Promotion Methods:

Method 1: MLflow Model Registry Transition (recommended):

from mlflow.tracking import MlflowClient

client = MlflowClient()

# Get latest staging model version
staging_model = f"staging.models.sentiment_analysis"
staging_versions = client.search_model_versions(f"name='{staging_model}'")
latest_staging = max(staging_versions, key=lambda x: int(x.version))

# Copy model to production registry
prod_model = f"prod.models.sentiment_analysis"
client.copy_model_version(
    src_model_uri=f"models:/{staging_model}/{latest_staging.version}",
    dst_name=prod_model
)

# Set champion alias in prod
new_prod_version = client.get_latest_versions(prod_model, stages=["None"])[0].version
client.set_registered_model_alias(prod_model, "champion", new_prod_version)

Method 2: Unity Catalog Model Sharing (if same metastore):

# Create prod model version pointing to staging artifacts
# (artifacts remain in staging location, metadata copied to prod)
client = MlflowClient()
staging_model_uri = f"models:/staging.models.sentiment_analysis@champion"

# Register in prod catalog (copies metadata, references artifacts)
mlflow.register_model(
    model_uri=staging_model_uri,
    name="prod.models.sentiment_analysis",
    tags={
        "promoted_from": "staging",
        "staging_version": staging_version,
        "promotion_timestamp": current_timestamp
    }
)

Goal: Deploy tested model binary to production without retraining.

Model Serving and Endpoints

Endpoint Lifecycle

┌─────────────────────────────────────────────────────────────────┐
│ 1. Model Registered to Unity Catalog                            │
│    - Model version created in {catalog}.models.{model_name}     │
│    - Champion alias set to latest version                       │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ 2. Model Serving Endpoint Created                               │
│    - Endpoint name: {model_name} (e.g., sentiment_analysis)     │
│    - Served entity: {catalog}.models.{model_name}@champion      │
│    - Workload size: Small / Medium / Large                      │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ 3. Endpoint Provisioned                                         │
│    - Compute resources allocated                                │
│    - Model loaded into memory                                   │
│    - Health checks pass                                         │
│    - Status: READY                                              │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ 4. Endpoint Serves Predictions                                  │
│    - DLT pipelines call via ai_query()                          │
│    - Batch inference jobs call via REST API                     │
│    - Real-time applications call via SDK                        │
└─────────────────────────────────────────────────────────────────┘

Endpoint Configuration

Example Endpoint Creation:

from mlflow.deployments import get_deploy_client

client = get_deploy_client("databricks")

endpoint_config = {
    "served_entities": [
        {
            "name": "sentiment_analysis",
            "entity_name": f"{catalog_name}.models.sentiment_analysis",
            "entity_version": "1",  # Or use @champion alias
            "workload_size": "Small",
            "scale_to_zero_enabled": True,
            "environment_vars": {
                "MLFLOW_REGISTRY_URI": "databricks-uc"
            }
        }
    ],
    "traffic_config": {
        "routes": [
            {
                "served_model_name": "sentiment_analysis",
                "traffic_percentage": 100
            }
        ]
    }
}

# Create or update endpoint
try:
    endpoint = client.create_endpoint(
        name="sentiment_analysis",
        config=endpoint_config
    )
except Exception:
    # Endpoint exists, update it
    endpoint = client.update_endpoint(
        endpoint="sentiment_analysis",
        config=endpoint_config
    )

Using Models in DLT Pipelines

ai_query() Function:

-- DLT pipeline transformation
CREATE OR REPLACE STREAMING LIVE TABLE sentiment_analysis AS
SELECT
  message_id,
  content,
  ai_query(
    endpoint => 'sentiment_analysis',
    request => content,
    failOnError => false
  ) AS sentiment_result
FROM STREAM(LIVE.messages_preprocessed);

Configuration:

# Pipeline configuration
configuration:
  "spark.databricks.ai.query.timeout": "180s"
  "spark.databricks.ai.query.retryPolicy": "exponential"
  "spark.databricks.ai.query.maxRetries": "3"
  "spark.databricks.ai.query.maxConcurrentRequests": "20"

A/B Testing and Canary Deployments

A/B Testing Setup

Scenario: Test new model version against current production model.

Configuration:

endpoint_config = {
    "served_entities": [
        {
            "name": "champion_model",  # Descriptive name, not version number
            "entity_name": "prod.models.sentiment_analysis",
            "entity_version": "5",  # Current champion version
            "workload_size": "Small",
        },
        {
            "name": "challenger_model",  # Descriptive name, not version number
            "entity_name": "prod.models.sentiment_analysis",
            "entity_version": "6",  # New challenger version
            "workload_size": "Small",
        }
    ],
    "traffic_config": {
        "routes": [
            {
                "served_model_name": "champion_model",
                "traffic_percentage": 90  # 90% to current champion
            },
            {
                "served_model_name": "challenger_model",
                "traffic_percentage": 10  # 10% to new challenger
            }
        ]
    }
}

Monitoring:

-- Compare model performance
SELECT
  served_model_name,
  COUNT(*) as requests,
  AVG(latency_ms) as avg_latency,
  COUNT(CASE WHEN error IS NOT NULL THEN 1 END) as errors
FROM model_serving_logs
WHERE endpoint_name = 'sentiment_analysis'
  AND timestamp >= current_timestamp() - INTERVAL 1 HOUR
GROUP BY served_model_name;

Promotion Decision:

Monitor metrics for 24-48 hours
If challenger_model (version 6) performs better, increase traffic to 50%, then 100%
If challenger_model performs worse, revert to 100% champion_model (version 5)
Once validated, update champion alias to point to version 6

Canary Deployment

Scenario: Gradually roll out new model version to minimize risk.

Steps:

10% traffic: Deploy new version, route 10% traffic

# Update traffic config: 90% champion (v5), 10% challenger (v6)

Monitor for 6-12 hours: Check error rates, latency, accuracy

50% traffic: If metrics good, increase to 50%

# Update traffic config: 50% champion (v5), 50% challenger (v6)

Monitor for 12-24 hours: Continue validation

100% traffic: Complete rollout

# Update traffic config: 0% old version, 100% new version (v6)

Update champion alias: Point champion alias to new version 6

client.set_registered_model_alias(
    name="prod.models.sentiment_analysis",
    alias="champion",
    version="6"
)
# Archive old champion (v5) for rollback
client.set_registered_model_alias(
    name="prod.models.sentiment_analysis",
    alias="archive",
    version="5"
)

Model Versioning

Version Numbering

Important: DO NOT use version suffixes in model names (e.g., sentiment_v1, sentiment_v2). MLflow automatically handles versioning.

Correct Approach:

Model name: sentiment_analysis (no version suffix)
MLflow assigns versions: 1, 2, 3, 4, ...
Use aliases to mark special versions: champion, challenger, staging, production

Wrong Approach (do not do this):

# BAD: Version in name
"prod.models.sentiment_analysis_v1"  # Wrong!
"prod.models.sentiment_analysis_v2"  # Wrong!

# GOOD: Version handled by MLflow
"prod.models.sentiment_analysis"  # Correct - single name, MLflow versions it

Aliases

Standard Aliases:

champion: Current production model (actively serving predictions)
challenger: Model being tested in production (A/B test or canary)
staging: Latest model validated in staging environment
archive: Previous production model (for quick rollback)

Example Usage:

from mlflow.tracking import MlflowClient

client = MlflowClient()

# Promote version 6 to champion
client.set_registered_model_alias(
    name="prod.models.sentiment_analysis",
    alias="champion",
    version="6"
)

# Archive previous champion (version 5) for rollback capability
client.set_registered_model_alias(
    name="prod.models.sentiment_analysis",
    alias="archive",
    version="5"
)

# Mark version 7 as challenger for A/B testing
client.set_registered_model_alias(
    name="prod.models.sentiment_analysis",
    alias="challenger",
    version="7"
)

# Reference models by alias in code
champion_uri = "models:/prod.models.sentiment_analysis@champion"
model = mlflow.pyfunc.load_model(champion_uri)

Model Tags

Best Practices: Tag models with metadata for tracking.

client.set_model_version_tag(
    name="prod.models.sentiment_analysis",
    version="6",
    key="git_commit",
    value="abc123def456"
)

client.set_model_version_tag(
    name="prod.models.sentiment_analysis",
    version="6",
    key="training_data_date",
    value="2025-01-15"
)

client.set_model_version_tag(
    name="prod.models.sentiment_analysis",
    version="6",
    key="promoted_from_staging_version",
    value="4"
)

Rollback Procedures

Scenario 1: Model Performance Degradation

Symptoms: Production model accuracy drops, error rate increases

Steps:

Identify previous good version:

client = MlflowClient()
archive_version = client.get_model_version_by_alias("prod.models.sentiment_analysis", "archive")

Update endpoint to use archive version:

client = get_deploy_client("databricks")
client.update_endpoint(
    endpoint="sentiment_analysis",
    config={
        "served_entities": [{
            "entity_name": "prod.models.sentiment_analysis",
            "entity_version": archive_version.version,
            "workload_size": "Small"
        }]
    }
)

Update champion alias to archive version:

mlflow_client.set_registered_model_alias(
    "prod.models.sentiment_analysis",
    "champion",
    archive_version.version
)

Timeline: 2-5 minutes

Scenario 2: Model Serving Endpoint Failure

Symptoms: Endpoint returning errors, not responding

Steps:

Check endpoint status:

databricks model-serving get sentiment_analysis --profile ref-prod

Restart endpoint:

databricks model-serving stop sentiment_analysis --profile ref-prod
databricks model-serving start sentiment_analysis --profile ref-prod

If restart doesn't work, recreate endpoint with previous version

Timeline: 5-10 minutes

Scenario 3: Data Drift Detected

Symptoms: Model performance degrades gradually over time

Steps:

Retrain model in staging on latest production data
Validate new model version in staging
Promote new model version to production
Monitor performance improvement

Timeline: 1-2 hours (depends on training time)

Model Performance Monitoring

Metrics to Track

Accuracy Metrics:

Precision, Recall, F1-Score
Confusion matrix

Operational Metrics:

Prediction latency (p50, p95, p99)
Request throughput (requests/second)
Error rate (errors/total requests)
Model uptime

Data Quality Metrics:

Input data distribution
Feature drift detection
Null value rates
Out-of-range values

Monitoring Implementation

Log Model Predictions:

-- Log predictions to monitoring table
CREATE OR REPLACE TABLE prod.monitoring.sentiment_predictions AS
SELECT
  prediction_id,
  timestamp,
  model_version,
  input_text,
  prediction,
  confidence_score,
  latency_ms
FROM prod.gold.sentiment_predictions
WHERE timestamp >= current_date() - INTERVAL 30 DAYS;

Drift Detection:

# Example drift detection query
from databricks import sql

connection = sql.connect(...)
cursor = connection.cursor()

cursor.execute("""
  SELECT
    date_trunc('day', timestamp) as date,
    AVG(confidence_score) as avg_confidence,
    STDDEV(confidence_score) as stddev_confidence,
    COUNT(*) as prediction_count
  FROM prod.monitoring.sentiment_predictions
  WHERE timestamp >= current_date() - INTERVAL 7 DAYS
  GROUP BY date_trunc('day', timestamp)
  ORDER BY date
""")

# Alert if avg_confidence drops below threshold
for row in cursor.fetchall():
    if row.avg_confidence < 0.75:
        send_alert(f"Model confidence dropped to {row.avg_confidence} on {row.date}")

CI/CD Pipeline Architecture - Deployment automation
Model Deployment Guide - Developer guide for model development
Monitoring Guide - Production monitoring setup
Troubleshooting Guide - Model serving issues

PreviousData Flow Architecture NextSecurity Architecture

Last updated 5 months ago

hashtagOverview

hashtagModel Lifecycle

hashtagKey Principles

hashtag1. Train Once, Promote Binary

hashtag2. Staging Uses Production Data

hashtag3. Version Control and Lineage

hashtagModel Registration Process

hashtagStep 1: Model Development (Sandbox)

hashtagStep 2: Code Integration (Dev)

hashtagStep 3: Staging Training (Staging)

hashtagStep 4: Production Promotion (Prod)

hashtagModel Serving and Endpoints

hashtagEndpoint Lifecycle

hashtagEndpoint Configuration

hashtagUsing Models in DLT Pipelines

hashtagA/B Testing and Canary Deployments

hashtagA/B Testing Setup

hashtagCanary Deployment

hashtagModel Versioning

hashtagVersion Numbering

hashtagAliases

hashtagModel Tags

hashtagRollback Procedures

hashtagScenario 1: Model Performance Degradation

hashtagScenario 2: Model Serving Endpoint Failure

hashtagScenario 3: Data Drift Detected

hashtagModel Performance Monitoring

hashtagMetrics to Track

hashtagMonitoring Implementation

hashtagRelated Documentation

Overview

Model Lifecycle

Key Principles

1. Train Once, Promote Binary

2. Staging Uses Production Data

3. Version Control and Lineage

Model Registration Process

Step 1: Model Development (Sandbox)

Step 2: Code Integration (Dev)

Step 3: Staging Training (Staging)

Step 4: Production Promotion (Prod)

Model Serving and Endpoints

Endpoint Lifecycle

Endpoint Configuration

Using Models in DLT Pipelines

A/B Testing and Canary Deployments

A/B Testing Setup

Canary Deployment

Model Versioning

Version Numbering

Aliases

Model Tags

Rollback Procedures

Scenario 1: Model Performance Degradation

Scenario 2: Model Serving Endpoint Failure

Scenario 3: Data Drift Detected

Model Performance Monitoring

Metrics to Track

Monitoring Implementation

Related Documentation