Model Serving Issues Runbook

Purpose

This runbook provides step-by-step procedures for diagnosing and resolving issues with model serving endpoints and ai_query functions. Use this when models are returning errors, performing poorly, or unavailable.

Quick Assessment Checklist

When a model serving issue is detected:

Endpoint Name: Which model endpoint is affected?
Environment: sandbox/dev/staging/prod?
Symptoms: Errors, slow responses, or null results?
Timing: When did issue start? After deployment?
Impact: How many requests affected? Production traffic?
Recent Changes: Recent model updates or config changes?
Endpoint Status: Is endpoint READY, UPDATING, or FAILED?

Common Issues and Symptoms

Issue 1: Model Endpoint Not Found

Symptoms:

Error: Model serving endpoint 'sentiment_analysis' not found
ai_query returns error: endpoint does not exist
Pipeline fails with endpoint resolution error

Detection:

Pipeline fails immediately at ai_query call
Error message explicitly states endpoint not found
No traffic to endpoint in monitoring

Decision Tree:

Endpoint not found?
├─ Endpoint never created → Create endpoint (Step 1)
├─ Endpoint deleted accidentally → Recreate endpoint (Step 2)
└─ Wrong endpoint name → Fix reference (Step 3)

Step 1: Create Model Endpoint

Verify model is registered:

databricks sql execute "SHOW MODELS IN prod.models" --profile ref-prod

If model doesn't exist, register it first:

# Trigger model registration job
databricks jobs run-now <register_model_job_id> --profile ref-prod

Wait for model registration (check job status):

databricks jobs runs get <run_id> --profile ref-prod

Create model serving endpoint:

# Via Databricks UI or REST API
# Machine Learning → Model Serving → Create Endpoint

Or use MLflow API:

from mlflow.deployments import get_deploy_client

client = get_deploy_client("databricks")

endpoint_config = {
    "served_entities": [{
        "name": "sentiment_analysis",
        "entity_name": "prod.models.sentiment_analysis",
        "entity_version": "1",
        "workload_size": "Small",
        "scale_to_zero_enabled": True
    }]
}

client.create_endpoint(
    name="sentiment_analysis",
    config=endpoint_config
)

Verify: Check endpoint status:

databricks model-serving get sentiment_analysis --profile ref-prod

Step 2: Recreate Deleted Endpoint

Check if endpoint was recently deleted (audit logs):

# In Databricks UI: Admin Console → Audit Logs
# Filter: service_name = "mlflowModelServing", action = "delete"

Retrieve endpoint configuration from previous deployment or documentation
Recreate endpoint using Step 1 instructions
Update pipeline to use endpoint (may need to restart)

Step 3: Fix Endpoint Reference

List all available endpoints:

databricks model-serving list --profile ref-prod

Check pipeline SQL for endpoint name:

-- In transformation file
ai_query(
  endpoint => 'sentiment_analysis',  -- Check spelling
  request => content
)

If name is wrong, update transformation:

ai_query(
  endpoint => 'sentiment-analysis-prod',  -- Correct name
  request => content
)

Commit and deploy fix:

git commit -m "Fix model endpoint name reference"
git push origin main

When to Escalate: If endpoint cannot be created due to permissions or resource limits, escalate to admin

Issue 2: Model Returns Null Results

Symptoms:

ai_query completes without error
All results are NULL
Pipeline succeeds but no predictions
Empty sentiment_result columns

Detection:

SELECT
  COUNT(*) as total_rows,
  COUNT(sentiment_result) as non_null_results,
  COUNT(*) - COUNT(sentiment_result) as null_results
FROM prod.silver.sentiment_features;

Decision Tree:

Null results?
├─ Endpoint not ready → Wait for provisioning (Step 1)
├─ Invalid input format → Fix schema (Step 2)
└─ Model signature mismatch → Update model (Step 3)

Step 1: Check Endpoint Readiness

Get endpoint status:

databricks model-serving get sentiment_analysis --profile ref-prod

Check state field:
- READY → Endpoint is ready, proceed to Step 2
- UPDATING → Wait for update to complete (5-10 minutes)
- UPDATE_FAILED → Check error message and recreate

If UPDATING, monitor progress:

# Check every 30 seconds
watch -n 30 'databricks model-serving get sentiment_analysis --profile ref-prod | grep state'

If stuck in UPDATING for >15 minutes, restart endpoint:

databricks model-serving stop sentiment_analysis --profile ref-prod
databricks model-serving start sentiment_analysis --profile ref-prod

Step 2: Validate Input Format

Test endpoint with sample input:

curl -X POST "https://dbc-028d9e53-7ce6.cloud.databricks.com/serving-endpoints/sentiment_analysis/invocations" \
  -H "Authorization: Bearer $DATABRICKS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": ["This is a test message"]
  }'

Check response format:
- If error: Fix input schema
- If success: Input format is correct, check pipeline

Verify ai_query call matches model signature:

-- Check model signature
DESCRIBE MODEL prod.models.sentiment_analysis;

-- Ensure ai_query passes correct format
ai_query(
  endpoint => 'sentiment_analysis',
  request => content  -- Should be string, not struct
)

If input needs transformation:

ai_query(
  endpoint => 'sentiment_analysis',
  request => CAST(content AS STRING)  -- Explicit cast
)

Step 3: Check Model Signature

Get model details:

databricks sql execute "
  DESCRIBE MODEL prod.models.sentiment_analysis VERSION 1
" --profile ref-prod

Verify signature matches ai_query usage:

# Expected signature (from model registration)
signature = infer_signature(
    model_input=["sample text"],  # List of strings
    model_output={"sentiment": "positive", "score": "0.95"}
)

If signature mismatch, re-register model with correct signature:

# Run model registration job
databricks jobs run-now <register_model_job_id> --profile ref-prod

Update endpoint to use new model version:

client = get_deploy_client("databricks")
client.update_endpoint(
    endpoint="sentiment_analysis",
    config={
        "served_entities": [{
            "entity_name": "prod.models.sentiment_analysis",
            "entity_version": "2",  # New version
            "workload_size": "Small"
        }]
    }
)

When to Escalate: If model consistently returns nulls despite correct input, escalate to ML engineering

Issue 3: Schema Parse Errors

Symptoms:

AI_FUNCTION_MODEL_SCHEMA_PARSE_ERROR: Unable to parse model output schema
Error parsing JSON response from model
Dynamic keys in model output causing failures

Root Cause:

Model returns inconsistent field names (dynamic keys)
Model returns raw data types (floats) instead of strings
Model output doesn't match registered signature

Decision Tree:

Schema parse error?
├─ Dynamic keys → Fix model output (Step 1)
├─ Type mismatch → Use two-stage pattern (Step 2)
└─ Signature wrong → Re-register model (Step 3)

Step 1: Fix Model Dynamic Keys

Identify dynamic keys in model output:

# Example problematic output:
{
    "sentiment": "positive",
    "feature_1": 0.5,  # Dynamic key names
    "feature_2": 0.8,  # ai_query can't parse
    "feature_N": 0.3
}

Update model to return fixed schema:

# In src/models/internal/sentiment_analysis/model.py

def predict(self, texts):
    results = []
    for text in texts:
        result = {
            "sentiment": "positive",
            "score": str(0.95),  # ALL values as strings
            "features": []  # No dynamic keys, use array
        }
        results.append(result)
    return results

Re-register model:

databricks jobs run-now <register_model_job_id> --profile ref-prod

Update endpoint to new version

Step 2: Use Two-Stage Pipeline Pattern

Stage 1: Get raw AI results (as string):

CREATE OR REPLACE STREAMING LIVE TABLE sentiment_features_raw AS
SELECT
  message_id,
  content,
  ai_query(
    endpoint => 'sentiment_analysis',
    request => content
  ) AS raw_ai_result
FROM STREAM(LIVE.messages);

Stage 2: Parse and cast fields:

CREATE OR REPLACE STREAMING LIVE TABLE sentiment_features AS
SELECT
  message_id,
  raw_ai_result.sentiment AS sentiment,
  CAST(NULLIF(raw_ai_result.score, '') AS DECIMAL(10,6)) AS score,
  raw_ai_result.features AS features
FROM STREAM(LIVE.sentiment_features_raw)
WHERE raw_ai_result IS NOT NULL;

Add error handling table:

CREATE OR REPLACE STREAMING LIVE TABLE sentiment_features_errors AS
SELECT
  message_id,
  content,
  raw_ai_result,
  'Failed to parse AI result' AS error_reason
FROM STREAM(LIVE.sentiment_features_raw)
WHERE raw_ai_result IS NULL OR
      TRY_CAST(raw_ai_result.score AS DECIMAL(10,6)) IS NULL;

Deploy updated pipeline

Step 3: Re-register Model with Golden Example

Create golden example (perfect output):

golden_example = {
    "sentiment": "positive",
    "score": "0.95",  # All strings
    "confidence": "0.88"
}

Use golden example for signature:

from mlflow.models import infer_signature

signature = infer_signature(
    model_input=["sample text"],
    model_output=golden_example  # Use golden example
)

mlflow.pyfunc.log_model(
    python_model=model,
    signature=signature,
    artifact_path="model"
)

Re-register and update endpoint

When to Escalate: If schema issues persist, escalate to ML engineering for model redesign

Issue 4: High Latency / Timeout

Symptoms:

ai_query requests taking >120 seconds
Timeout errors in pipeline logs
Model endpoint under heavy load
p95 latency spikes in monitoring

Detection:

-- Check endpoint metrics
SELECT
  endpoint_name,
  AVG(latency_ms) as avg_latency,
  PERCENTILE(latency_ms, 0.95) as p95_latency,
  COUNT(*) as request_count
FROM model_serving_logs
WHERE timestamp > CURRENT_TIMESTAMP() - INTERVAL 1 HOUR
GROUP BY endpoint_name;

Decision Tree:

High latency?
├─ Endpoint overloaded → Scale up (Step 1)
├─ Model inefficient → Optimize model (Step 2)
└─ Batch size too large → Reduce concurrency (Step 3)

Step 1: Scale Up Endpoint

Check current endpoint size:

databricks model-serving get sentiment_analysis --profile ref-prod

Increase workload size:

client = get_deploy_client("databricks")
client.update_endpoint(
    endpoint="sentiment_analysis",
    config={
        "served_entities": [{
            "entity_name": "prod.models.sentiment_analysis",
            "entity_version": "1",
            "workload_size": "Medium",  # Was Small
            "scale_to_zero_enabled": False  # Keep warm
        }]
    }
)

Wait for endpoint update (5-10 minutes)
Verify: Monitor latency improvements

Step 2: Optimize Model Performance

Profile model inference time:

import time

start = time.time()
result = model.predict(["test"])
inference_time = time.time() - start
print(f"Inference time: {inference_time}s")

If slow, optimize model:
- Reduce model size (quantization)
- Use faster model architecture
- Cache embeddings or features
- Batch predictions efficiently
Re-train and register optimized model
Update endpoint to new version

Step 3: Reduce Pipeline Concurrency

Update pipeline configuration:

# In pipeline YAML
configuration:
  # Reduce concurrent requests
  "spark.databricks.ai.query.maxConcurrentRequests": "10"  # Was 50

  # Reduce batch size
  "spark.sql.streaming.maxRowsPerTrigger": "500"  # Was 5000

  # Increase timeout
  "spark.databricks.ai.query.timeout": "300s"  # Was 120s

Redeploy pipeline with new settings
Monitor request rate to endpoint

When to Escalate: If latency issues persist after scaling, escalate for infrastructure review

Issue 5: Model Performance Degradation

Symptoms:

Model accuracy dropped in production
Increased error rate
User complaints about predictions
Data drift detected

Detection:

-- Compare recent vs historical accuracy
SELECT
  DATE_TRUNC('day', timestamp) as date,
  AVG(confidence_score) as avg_confidence,
  COUNT(CASE WHEN prediction_error = 1 THEN 1 END) * 100.0 / COUNT(*) as error_rate
FROM prod.monitoring.predictions
WHERE timestamp > CURRENT_DATE() - INTERVAL 7 DAYS
GROUP BY DATE_TRUNC('day', timestamp)
ORDER BY date DESC;

Decision Tree:

Performance degraded?
├─ Recent model update → Rollback to previous version (Step 1)
├─ Data drift → Retrain on new data (Step 2)
└─ Gradual degradation → A/B test new version (Step 3)

Step 1: Rollback to Previous Model Version

Identify current champion version:

databricks sql execute "
  SELECT version, creation_timestamp, tags
  FROM prod.models.sentiment_analysis
  ORDER BY version DESC
" --profile ref-prod

Find previous working version (check tags for performance metrics)

Update endpoint to previous version:

from mlflow.tracking import MlflowClient

client = MlflowClient()

# Set archive version as champion
client.set_registered_model_alias(
    "prod.models.sentiment_analysis",
    "champion",
    "3"  # Previous working version
)

Update serving endpoint:

deploy_client = get_deploy_client("databricks")
deploy_client.update_endpoint(
    endpoint="sentiment_analysis",
    config={
        "served_entities": [{
            "entity_name": "prod.models.sentiment_analysis@champion",
            "workload_size": "Small"
        }]
    }
)

Verify: Monitor accuracy returns to baseline

Timeline: 5-10 minutes

Step 2: Retrain Model on Recent Data

Trigger model training in staging:

# Staging trains on prod data
databricks jobs run-now <register_model_job_id_staging> --profile ref-staging

Wait for training completion (30-60 minutes typically)

Validate new model in staging:

-- Test on staging data
SELECT
  AVG(CASE WHEN predicted = actual THEN 1 ELSE 0 END) as accuracy
FROM staging.evaluation.test_predictions;

If accuracy improved, promote to production:
- Follow CI/CD promotion process
- Or manually promote staging model to prod
Monitor production performance for 24-48 hours

Timeline: 2-4 hours (including validation)

Step 3: A/B Test New Model Version

Deploy new model version alongside current:

client.update_endpoint(
    endpoint="sentiment_analysis",
    config={
        "served_entities": [
            {
                "name": "v3",
                "entity_name": "prod.models.sentiment_analysis",
                "entity_version": "3",
                "workload_size": "Small"
            },
            {
                "name": "v4",
                "entity_name": "prod.models.sentiment_analysis",
                "entity_version": "4",
                "workload_size": "Small"
            }
        ],
        "traffic_config": {
            "routes": [
                {"served_model_name": "v3", "traffic_percentage": 90},
                {"served_model_name": "v4", "traffic_percentage": 10}
            ]
        }
    }
)

Monitor both versions for 24 hours:

SELECT
  served_model_name,
  AVG(confidence_score) as avg_confidence,
  AVG(latency_ms) as avg_latency,
  COUNT(*) as requests
FROM model_serving_logs
WHERE timestamp > CURRENT_TIMESTAMP() - INTERVAL 24 HOURS
GROUP BY served_model_name;

If new version better, gradually increase traffic:
- 10% → 25% → 50% → 100%
Once at 100%, update champion alias

Timeline: 2-3 days (for proper validation)

When to Escalate: If retraining doesn't improve performance, escalate to data science team

Schema Compatibility Checks

Before deploying model updates, verify schema compatibility:

Check 1: Model Input Schema

# Get current model signature
from mlflow.tracking import MlflowClient

client = MlflowClient()
model_version = client.get_model_version("prod.models.sentiment_analysis", "1")
signature = model_version.signature

# Verify input schema matches pipeline usage
# Expected: {"inputs": ["string"]}

Check 2: Model Output Schema

-- Test model output structure
SELECT
  ai_query('sentiment_analysis', 'test message') as result;

-- Verify result structure:
-- {"sentiment": "positive", "score": "0.95"}

Check 3: Pipeline Compatibility

-- Ensure pipeline can parse model output
CREATE OR REPLACE TEMP VIEW test_model_output AS
SELECT
  ai_query('sentiment_analysis', 'test') AS result;

-- Try to access fields
SELECT
  result.sentiment,
  CAST(result.score AS DECIMAL) as score
FROM test_model_output;

Performance Optimization

When to Retrain vs Rollback

Rollback if:

Recent model update caused immediate degradation
Previous version had good performance
Need quick fix (<15 minutes)
Data distribution hasn't changed

Retrain if:

Gradual performance degradation over weeks
Data drift detected
New data patterns emerged
Feature distribution changed

Model Serving Best Practices

Use appropriate workload size:
- Small: <100 req/min, development
- Medium: 100-1000 req/min, production
- Large: >1000 req/min, high-scale production

Enable scale-to-zero for dev:

"scale_to_zero_enabled": True  # Dev/staging only

Keep production warm:

"scale_to_zero_enabled": False  # Production

Set appropriate timeouts:

"spark.databricks.ai.query.timeout": "180s"  # 3 minutes
"spark.databricks.ai.query.retryPolicy": "exponential"
"spark.databricks.ai.query.maxRetries": "3"

Verification Checklist

After resolving model serving issues:

Endpoint status is READY
Test request returns expected output
ai_query in pipeline returns non-null results
Latency within acceptable range (<5s p95)
Error rate <1% of requests
Monitoring shows normal traffic patterns
Data quality expectations passing

Final Verification:

# 1. Check endpoint health
databricks model-serving get sentiment_analysis --profile ref-prod

# 2. Test endpoint
curl -X POST "https://dbc-028d9e53-7ce6.cloud.databricks.com/serving-endpoints/sentiment_analysis/invocations" \
  -H "Authorization: Bearer $DATABRICKS_TOKEN" \
  -d '{"inputs": ["Test message"]}'

# 3. Verify pipeline results
databricks sql execute "
  SELECT
    COUNT(*) as total,
    COUNT(sentiment_result) as non_null,
    AVG(CAST(sentiment_result.score AS DECIMAL)) as avg_score
  FROM prod.silver.sentiment_features
  WHERE timestamp > CURRENT_TIMESTAMP() - INTERVAL 1 HOUR
" --profile ref-prod

Model Deployment Guide - ai_query best practices and patterns
Model Promotion Architecture - Model lifecycle and versioning
Troubleshooting Guide - Additional model serving errors
Pipeline Failure Runbook - Pipeline-specific issues
Monitoring Guide - Setting up model performance alerts

PreviousRunbooks NextPipeline Failure Runbook

Last updated 5 months ago

hashtagPurpose

hashtagQuick Assessment Checklist

hashtagCommon Issues and Symptoms

hashtagIssue 1: Model Endpoint Not Found

hashtagIssue 2: Model Returns Null Results

hashtagIssue 3: Schema Parse Errors

hashtagIssue 4: High Latency / Timeout

hashtagIssue 5: Model Performance Degradation

hashtagSchema Compatibility Checks

hashtagCheck 1: Model Input Schema

hashtagCheck 2: Model Output Schema

hashtagCheck 3: Pipeline Compatibility

hashtagPerformance Optimization

hashtagWhen to Retrain vs Rollback

hashtagModel Serving Best Practices

hashtagVerification Checklist

hashtagRelated Documentation

Purpose

Quick Assessment Checklist

Common Issues and Symptoms

Issue 1: Model Endpoint Not Found

Issue 2: Model Returns Null Results

Issue 3: Schema Parse Errors

Issue 4: High Latency / Timeout

Issue 5: Model Performance Degradation

Schema Compatibility Checks

Check 1: Model Input Schema

Check 2: Model Output Schema

Check 3: Pipeline Compatibility

Performance Optimization

When to Retrain vs Rollback

Model Serving Best Practices

Verification Checklist

Related Documentation