Pipeline Failure Runbook
Purpose
This runbook provides step-by-step procedures for handling DLT (Delta Live Tables) pipeline failures in production and non-production environments. Follow these procedures to quickly identify, resolve, and prevent pipeline failures.
Quick Assessment Checklist
When a pipeline failure is detected, work through this checklist:
Common Failure Scenarios
Scenario 1: Pipeline Stuck in RUNNING State
Symptoms:
Pipeline shows "RUNNING" for hours without progress
No rows scanned or written
Event log shows no activity
Zero progress on data processing
Decision Tree:
Step 1: Schema Conflict Resolution
Check for schema conflicts in event log:
Look for error:
INCOMPATIBLE_VIEW_SCHEMA_CHANGEStop the pipeline:
Edit the problematic transformation to use
CREATE OR REPLACE:Redeploy the pipeline:
Start fresh pipeline update:
Verify: Monitor pipeline progress for 5-10 minutes
Step 2: Checkpoint Corruption
Stop the pipeline:
Reset the pipeline (clears checkpoint state):
Start fresh update:
Verify: Check event log shows "Processing batch 1"
Step 3: Resource Exhaustion
Check cluster configuration in pipeline YAML:
Increase cluster size if necessary (requires redeployment)
Stop and restart pipeline with new configuration
When to Escalate: If pipeline still stuck after 30 minutes, escalate to Level 2 support
Scenario 2: AI Query Timeout Errors
Symptoms:
Decision Tree:
Step 1: Increase Timeout Settings
Stop the failing pipeline:
Edit pipeline YAML configuration:
Commit and deploy changes:
Wait for CI/CD deployment (dev → staging → prod)
Verify: Monitor pipeline runs successfully with new settings
Step 2: Add Retry Logic to Pipeline
Modify transformation to handle failures gracefully:
Create recovery pipeline to retry failed records:
Deploy and restart pipeline
Step 3: Check Model Endpoint Health
Check endpoint status:
If endpoint is down or unhealthy:
Test endpoint directly:
If endpoint responds, restart pipeline
When to Escalate: If endpoint is consistently slow or failing, escalate for model optimization
Scenario 3: Zero Rows Scanned/Written
Symptoms:
Pipeline completes successfully
Event log shows 0 rows scanned
Tables exist but are empty
No errors in logs
Decision Tree:
Step 1: Check Source Data
Verify source table has data:
If source is empty, check upstream pipeline:
If upstream failed, resolve that pipeline first (recursive)
Verify: Source table has records before proceeding
Step 2: Review Filter Conditions
Check transformation SQL for restrictive filters:
Temporarily disable filters to test:
If rows appear, adjust filter logic
Verify: Pipeline processes expected number of rows
Step 3: Verify Volume Paths
Check volume exists and is accessible:
If volume missing, check pipeline volume configuration:
Verify S3 bucket exists and has data:
If volume path wrong, update and redeploy
When to Escalate: If data exists but pipeline not reading it, escalate to data engineering team
Scenario 4: Data Quality Expectation Failures
Symptoms:
Decision Tree:
Step 1: Log and Monitor (Low Failure Rate)
Query failed records:
Analyze failure patterns:
If failure rate is acceptable (<5%), adjust expectation to log only:
Document decision and monitor trend
Step 2: Investigate Root Cause (High Failure Rate)
Check model output schema:
Test model endpoint with sample data:
If model returns invalid values, roll back to previous version:
Restart pipeline with rolled-back model
Step 3: Schema Change or Model Issue (Complete Failure)
Check if recent model update:
Roll back to last known good model version
If schema changed, update transformation:
Redeploy pipeline with schema fixes
When to Escalate: If data quality issues persist after model rollback, escalate to ML engineering team
Step-by-Step Resolution Procedures
Procedure A: Emergency Pipeline Stop
Use when: Pipeline is corrupting data or causing cascading failures
Steps:
Stop the problematic pipeline immediately:
Verify pipeline stopped:
Check downstream pipelines for impact:
Stop dependent pipelines if necessary
Document stop time and reason in incident log
Proceed to root cause analysis
Procedure B: Data Rollback and Recovery
Use when: Pipeline wrote corrupted data to tables
Steps:
Identify affected tables:
Check table history:
Identify last good version (before corruption):
Restore table to previous version:
Verify data integrity:
Document rollback in incident log
Fix pipeline issue before restarting
Procedure C: Progressive Pipeline Restart
Use when: Restarting pipeline after fixes
Steps:
Verify fix deployed:
Start pipeline in test mode (if available):
Monitor for 10 minutes in dev:
Check event log for errors
Verify rows scanned/written
Check data quality metrics
If dev successful, proceed to staging:
Monitor staging for 20 minutes
If staging successful, restart production:
Monitor production closely for first hour
Set alert for anomalies
Rollback Procedures
Rollback Method 1: Git Revert (Configuration Changes)
Use when: Pipeline failure due to code or configuration changes
Steps:
Identify problematic commit:
Revert the commit:
Push revert:
Wait for CI/CD to deploy:
Monitor GitHub Actions
Verify deployment to dev → staging → prod
Verify pipeline health after deployment
Timeline: 10-15 minutes (full pipeline)
Rollback Method 2: Pipeline Reset (Checkpoint Issues)
Use when: Checkpoint corruption or state issues
Steps:
Stop pipeline:
Reset pipeline (clears all state):
Restart from beginning:
Monitor full reprocessing
Timeline: Depends on data volume (hours for large datasets)
Warning: Pipeline will reprocess all data from beginning
Rollback Method 3: Table Restore (Data Corruption)
Use when: Pipeline wrote bad data to tables
See Procedure B above for detailed steps.
Timeline: 2-5 minutes per table
When to Escalate
Level 1: Self-Service (0-30 minutes)
Use: This runbook and troubleshooting guide Actions: Follow scenarios and procedures above Success Criteria: Pipeline restored within 30 minutes
Level 2: Team Support (30 minutes - 2 hours)
Escalate to: #ml-pipelines Slack channel Provide:
Pipeline ID and name
Environment (prod/staging/dev)
Error messages from event log
Steps already attempted
Impact assessment
Contact: [email protected]
Level 3: Production Emergency (2+ hours or critical impact)
Escalate to: Taylor Laing ([email protected]) Use for:
Production data pipeline completely down
Data corruption affecting downstream systems
Multiple pipelines failing simultaneously
Customer-facing impact
Actions:
Call emergency contact immediately
Post in #incidents Slack channel
Document all actions taken
Prepare for post-mortem
Prevention Measures
Before Deployment
Test in sandbox first:
Run validation:
Review pipeline SQL:
Check for
CREATE OR REPLACEon schema changesVerify filter conditions are correct
Ensure ai_query has proper timeout settings
Check data expectations:
Set realistic thresholds (not too strict)
Use
expectfor warnings,expect_or_dropsparingly
During Deployment
Monitor CI/CD pipeline:
Watch GitHub Actions progress
Check deployment logs for warnings
Progressive validation:
Verify dev deployment before staging
Verify staging before production
Staged rollout:
Deploy during low-traffic periods
Monitor for 1 hour after production deployment
After Deployment
Set up monitoring alerts:
Review data quality metrics:
Check expectation failure rates
Monitor row counts and processing times
Document changes:
Update CHANGELOG.md
Note any configuration changes in runbook
Verification Steps
After resolving a pipeline failure, verify:
Final Check:
Related Documentation
Troubleshooting Guide - Detailed error messages and solutions
Deployment Guide - Deployment procedures
DLT Pipelines Guide - Pipeline development best practices
Monitoring Guide - Setting up alerts and dashboards
Model Serving Issues Runbook - Model endpoint troubleshooting
Last updated