Model Serving Issues Runbook
Purpose
This runbook provides step-by-step procedures for diagnosing and resolving issues with model serving endpoints and ai_query functions. Use this when models are returning errors, performing poorly, or unavailable.
Quick Assessment Checklist
When a model serving issue is detected:
Common Issues and Symptoms
Issue 1: Model Endpoint Not Found
Symptoms:
Detection:
Pipeline fails immediately at ai_query call
Error message explicitly states endpoint not found
No traffic to endpoint in monitoring
Decision Tree:
Step 1: Create Model Endpoint
Verify model is registered:
If model doesn't exist, register it first:
Wait for model registration (check job status):
Create model serving endpoint:
Or use MLflow API:
Verify: Check endpoint status:
Step 2: Recreate Deleted Endpoint
Check if endpoint was recently deleted (audit logs):
Retrieve endpoint configuration from previous deployment or documentation
Recreate endpoint using Step 1 instructions
Update pipeline to use endpoint (may need to restart)
Step 3: Fix Endpoint Reference
List all available endpoints:
Check pipeline SQL for endpoint name:
If name is wrong, update transformation:
Commit and deploy fix:
When to Escalate: If endpoint cannot be created due to permissions or resource limits, escalate to admin
Issue 2: Model Returns Null Results
Symptoms:
ai_query completes without error
All results are NULL
Pipeline succeeds but no predictions
Empty sentiment_result columns
Detection:
Decision Tree:
Step 1: Check Endpoint Readiness
Get endpoint status:
Check state field:
READY→ Endpoint is ready, proceed to Step 2UPDATING→ Wait for update to complete (5-10 minutes)UPDATE_FAILED→ Check error message and recreate
If UPDATING, monitor progress:
If stuck in UPDATING for >15 minutes, restart endpoint:
Step 2: Validate Input Format
Test endpoint with sample input:
Check response format:
If error: Fix input schema
If success: Input format is correct, check pipeline
Verify ai_query call matches model signature:
If input needs transformation:
Step 3: Check Model Signature
Get model details:
Verify signature matches ai_query usage:
If signature mismatch, re-register model with correct signature:
Update endpoint to use new model version:
When to Escalate: If model consistently returns nulls despite correct input, escalate to ML engineering
Issue 3: Schema Parse Errors
Symptoms:
Root Cause:
Model returns inconsistent field names (dynamic keys)
Model returns raw data types (floats) instead of strings
Model output doesn't match registered signature
Decision Tree:
Step 1: Fix Model Dynamic Keys
Identify dynamic keys in model output:
Update model to return fixed schema:
Re-register model:
Update endpoint to new version
Step 2: Use Two-Stage Pipeline Pattern
Stage 1: Get raw AI results (as string):
Stage 2: Parse and cast fields:
Add error handling table:
Deploy updated pipeline
Step 3: Re-register Model with Golden Example
Create golden example (perfect output):
Use golden example for signature:
Re-register and update endpoint
When to Escalate: If schema issues persist, escalate to ML engineering for model redesign
Issue 4: High Latency / Timeout
Symptoms:
ai_query requests taking >120 seconds
Timeout errors in pipeline logs
Model endpoint under heavy load
p95 latency spikes in monitoring
Detection:
Decision Tree:
Step 1: Scale Up Endpoint
Check current endpoint size:
Increase workload size:
Wait for endpoint update (5-10 minutes)
Verify: Monitor latency improvements
Step 2: Optimize Model Performance
Profile model inference time:
If slow, optimize model:
Reduce model size (quantization)
Use faster model architecture
Cache embeddings or features
Batch predictions efficiently
Re-train and register optimized model
Update endpoint to new version
Step 3: Reduce Pipeline Concurrency
Update pipeline configuration:
Redeploy pipeline with new settings
Monitor request rate to endpoint
When to Escalate: If latency issues persist after scaling, escalate for infrastructure review
Issue 5: Model Performance Degradation
Symptoms:
Model accuracy dropped in production
Increased error rate
User complaints about predictions
Data drift detected
Detection:
Decision Tree:
Step 1: Rollback to Previous Model Version
Identify current champion version:
Find previous working version (check tags for performance metrics)
Update endpoint to previous version:
Update serving endpoint:
Verify: Monitor accuracy returns to baseline
Timeline: 5-10 minutes
Step 2: Retrain Model on Recent Data
Trigger model training in staging:
Wait for training completion (30-60 minutes typically)
Validate new model in staging:
If accuracy improved, promote to production:
Follow CI/CD promotion process
Or manually promote staging model to prod
Monitor production performance for 24-48 hours
Timeline: 2-4 hours (including validation)
Step 3: A/B Test New Model Version
Deploy new model version alongside current:
Monitor both versions for 24 hours:
If new version better, gradually increase traffic:
10% → 25% → 50% → 100%
Once at 100%, update champion alias
Timeline: 2-3 days (for proper validation)
When to Escalate: If retraining doesn't improve performance, escalate to data science team
Schema Compatibility Checks
Before deploying model updates, verify schema compatibility:
Check 1: Model Input Schema
Check 2: Model Output Schema
Check 3: Pipeline Compatibility
Performance Optimization
When to Retrain vs Rollback
Rollback if:
Recent model update caused immediate degradation
Previous version had good performance
Need quick fix (<15 minutes)
Data distribution hasn't changed
Retrain if:
Gradual performance degradation over weeks
Data drift detected
New data patterns emerged
Feature distribution changed
Model Serving Best Practices
Use appropriate workload size:
Small: <100 req/min, development
Medium: 100-1000 req/min, production
Large: >1000 req/min, high-scale production
Enable scale-to-zero for dev:
Keep production warm:
Set appropriate timeouts:
Verification Checklist
After resolving model serving issues:
Final Verification:
Related Documentation
Model Deployment Guide - ai_query best practices and patterns
Model Promotion Architecture - Model lifecycle and versioning
Troubleshooting Guide - Additional model serving errors
Pipeline Failure Runbook - Pipeline-specific issues
Monitoring Guide - Setting up model performance alerts
Last updated