Model Serving Issues Runbook

Purpose

This runbook provides step-by-step procedures for diagnosing and resolving issues with model serving endpoints and ai_query functions. Use this when models are returning errors, performing poorly, or unavailable.

Quick Assessment Checklist

When a model serving issue is detected:

Common Issues and Symptoms

Issue 1: Model Endpoint Not Found

Symptoms:

Detection:

  • Pipeline fails immediately at ai_query call

  • Error message explicitly states endpoint not found

  • No traffic to endpoint in monitoring

Decision Tree:

Step 1: Create Model Endpoint

  1. Verify model is registered:

  2. If model doesn't exist, register it first:

  3. Wait for model registration (check job status):

  4. Create model serving endpoint:

  5. Or use MLflow API:

  6. Verify: Check endpoint status:

Step 2: Recreate Deleted Endpoint

  1. Check if endpoint was recently deleted (audit logs):

  2. Retrieve endpoint configuration from previous deployment or documentation

  3. Recreate endpoint using Step 1 instructions

  4. Update pipeline to use endpoint (may need to restart)

Step 3: Fix Endpoint Reference

  1. List all available endpoints:

  2. Check pipeline SQL for endpoint name:

  3. If name is wrong, update transformation:

  4. Commit and deploy fix:

When to Escalate: If endpoint cannot be created due to permissions or resource limits, escalate to admin


Issue 2: Model Returns Null Results

Symptoms:

  • ai_query completes without error

  • All results are NULL

  • Pipeline succeeds but no predictions

  • Empty sentiment_result columns

Detection:

Decision Tree:

Step 1: Check Endpoint Readiness

  1. Get endpoint status:

  2. Check state field:

    • READY → Endpoint is ready, proceed to Step 2

    • UPDATING → Wait for update to complete (5-10 minutes)

    • UPDATE_FAILED → Check error message and recreate

  3. If UPDATING, monitor progress:

  4. If stuck in UPDATING for >15 minutes, restart endpoint:

Step 2: Validate Input Format

  1. Test endpoint with sample input:

  2. Check response format:

    • If error: Fix input schema

    • If success: Input format is correct, check pipeline

  3. Verify ai_query call matches model signature:

  4. If input needs transformation:

Step 3: Check Model Signature

  1. Get model details:

  2. Verify signature matches ai_query usage:

  3. If signature mismatch, re-register model with correct signature:

  4. Update endpoint to use new model version:

When to Escalate: If model consistently returns nulls despite correct input, escalate to ML engineering


Issue 3: Schema Parse Errors

Symptoms:

Root Cause:

  • Model returns inconsistent field names (dynamic keys)

  • Model returns raw data types (floats) instead of strings

  • Model output doesn't match registered signature

Decision Tree:

Step 1: Fix Model Dynamic Keys

  1. Identify dynamic keys in model output:

  2. Update model to return fixed schema:

  3. Re-register model:

  4. Update endpoint to new version

Step 2: Use Two-Stage Pipeline Pattern

  1. Stage 1: Get raw AI results (as string):

  2. Stage 2: Parse and cast fields:

  3. Add error handling table:

  4. Deploy updated pipeline

Step 3: Re-register Model with Golden Example

  1. Create golden example (perfect output):

  2. Use golden example for signature:

  3. Re-register and update endpoint

When to Escalate: If schema issues persist, escalate to ML engineering for model redesign


Issue 4: High Latency / Timeout

Symptoms:

  • ai_query requests taking >120 seconds

  • Timeout errors in pipeline logs

  • Model endpoint under heavy load

  • p95 latency spikes in monitoring

Detection:

Decision Tree:

Step 1: Scale Up Endpoint

  1. Check current endpoint size:

  2. Increase workload size:

  3. Wait for endpoint update (5-10 minutes)

  4. Verify: Monitor latency improvements

Step 2: Optimize Model Performance

  1. Profile model inference time:

  2. If slow, optimize model:

    • Reduce model size (quantization)

    • Use faster model architecture

    • Cache embeddings or features

    • Batch predictions efficiently

  3. Re-train and register optimized model

  4. Update endpoint to new version

Step 3: Reduce Pipeline Concurrency

  1. Update pipeline configuration:

  2. Redeploy pipeline with new settings

  3. Monitor request rate to endpoint

When to Escalate: If latency issues persist after scaling, escalate for infrastructure review


Issue 5: Model Performance Degradation

Symptoms:

  • Model accuracy dropped in production

  • Increased error rate

  • User complaints about predictions

  • Data drift detected

Detection:

Decision Tree:

Step 1: Rollback to Previous Model Version

  1. Identify current champion version:

  2. Find previous working version (check tags for performance metrics)

  3. Update endpoint to previous version:

  4. Update serving endpoint:

  5. Verify: Monitor accuracy returns to baseline

Timeline: 5-10 minutes

Step 2: Retrain Model on Recent Data

  1. Trigger model training in staging:

  2. Wait for training completion (30-60 minutes typically)

  3. Validate new model in staging:

  4. If accuracy improved, promote to production:

    • Follow CI/CD promotion process

    • Or manually promote staging model to prod

  5. Monitor production performance for 24-48 hours

Timeline: 2-4 hours (including validation)

Step 3: A/B Test New Model Version

  1. Deploy new model version alongside current:

  2. Monitor both versions for 24 hours:

  3. If new version better, gradually increase traffic:

    • 10% → 25% → 50% → 100%

  4. Once at 100%, update champion alias

Timeline: 2-3 days (for proper validation)

When to Escalate: If retraining doesn't improve performance, escalate to data science team


Schema Compatibility Checks

Before deploying model updates, verify schema compatibility:

Check 1: Model Input Schema

Check 2: Model Output Schema

Check 3: Pipeline Compatibility


Performance Optimization

When to Retrain vs Rollback

Rollback if:

  • Recent model update caused immediate degradation

  • Previous version had good performance

  • Need quick fix (<15 minutes)

  • Data distribution hasn't changed

Retrain if:

  • Gradual performance degradation over weeks

  • Data drift detected

  • New data patterns emerged

  • Feature distribution changed

Model Serving Best Practices

  1. Use appropriate workload size:

    • Small: <100 req/min, development

    • Medium: 100-1000 req/min, production

    • Large: >1000 req/min, high-scale production

  2. Enable scale-to-zero for dev:

  3. Keep production warm:

  4. Set appropriate timeouts:


Verification Checklist

After resolving model serving issues:

Final Verification:


Last updated