Pipeline Failure Runbook

Purpose

This runbook provides step-by-step procedures for handling DLT (Delta Live Tables) pipeline failures in production and non-production environments. Follow these procedures to quickly identify, resolve, and prevent pipeline failures.

Quick Assessment Checklist

When a pipeline failure is detected, work through this checklist:

Common Failure Scenarios

Scenario 1: Pipeline Stuck in RUNNING State

Symptoms:

  • Pipeline shows "RUNNING" for hours without progress

  • No rows scanned or written

  • Event log shows no activity

  • Zero progress on data processing

Decision Tree:

Step 1: Schema Conflict Resolution

  1. Check for schema conflicts in event log:

  2. Look for error: INCOMPATIBLE_VIEW_SCHEMA_CHANGE

  3. Stop the pipeline:

  4. Edit the problematic transformation to use CREATE OR REPLACE:

  5. Redeploy the pipeline:

  6. Start fresh pipeline update:

  7. Verify: Monitor pipeline progress for 5-10 minutes

Step 2: Checkpoint Corruption

  1. Stop the pipeline:

  2. Reset the pipeline (clears checkpoint state):

  3. Start fresh update:

  4. Verify: Check event log shows "Processing batch 1"

Step 3: Resource Exhaustion

  1. Check cluster configuration in pipeline YAML:

  2. Increase cluster size if necessary (requires redeployment)

  3. Stop and restart pipeline with new configuration

When to Escalate: If pipeline still stuck after 30 minutes, escalate to Level 2 support


Scenario 2: AI Query Timeout Errors

Symptoms:

Decision Tree:

Step 1: Increase Timeout Settings

  1. Stop the failing pipeline:

  2. Edit pipeline YAML configuration:

  3. Commit and deploy changes:

  4. Wait for CI/CD deployment (dev → staging → prod)

  5. Verify: Monitor pipeline runs successfully with new settings

Step 2: Add Retry Logic to Pipeline

  1. Modify transformation to handle failures gracefully:

  2. Create recovery pipeline to retry failed records:

  3. Deploy and restart pipeline

Step 3: Check Model Endpoint Health

  1. Check endpoint status:

  2. If endpoint is down or unhealthy:

  3. Test endpoint directly:

  4. If endpoint responds, restart pipeline

When to Escalate: If endpoint is consistently slow or failing, escalate for model optimization


Scenario 3: Zero Rows Scanned/Written

Symptoms:

  • Pipeline completes successfully

  • Event log shows 0 rows scanned

  • Tables exist but are empty

  • No errors in logs

Decision Tree:

Step 1: Check Source Data

  1. Verify source table has data:

  2. If source is empty, check upstream pipeline:

  3. If upstream failed, resolve that pipeline first (recursive)

  4. Verify: Source table has records before proceeding

Step 2: Review Filter Conditions

  1. Check transformation SQL for restrictive filters:

  2. Temporarily disable filters to test:

  3. If rows appear, adjust filter logic

  4. Verify: Pipeline processes expected number of rows

Step 3: Verify Volume Paths

  1. Check volume exists and is accessible:

  2. If volume missing, check pipeline volume configuration:

  3. Verify S3 bucket exists and has data:

  4. If volume path wrong, update and redeploy

When to Escalate: If data exists but pipeline not reading it, escalate to data engineering team


Scenario 4: Data Quality Expectation Failures

Symptoms:

Decision Tree:

Step 1: Log and Monitor (Low Failure Rate)

  1. Query failed records:

  2. Analyze failure patterns:

  3. If failure rate is acceptable (<5%), adjust expectation to log only:

  4. Document decision and monitor trend

Step 2: Investigate Root Cause (High Failure Rate)

  1. Check model output schema:

  2. Test model endpoint with sample data:

  3. If model returns invalid values, roll back to previous version:

  4. Restart pipeline with rolled-back model

Step 3: Schema Change or Model Issue (Complete Failure)

  1. Check if recent model update:

  2. Roll back to last known good model version

  3. If schema changed, update transformation:

  4. Redeploy pipeline with schema fixes

When to Escalate: If data quality issues persist after model rollback, escalate to ML engineering team


Step-by-Step Resolution Procedures

Procedure A: Emergency Pipeline Stop

Use when: Pipeline is corrupting data or causing cascading failures

Steps:

  1. Stop the problematic pipeline immediately:

  2. Verify pipeline stopped:

  3. Check downstream pipelines for impact:

  4. Stop dependent pipelines if necessary

  5. Document stop time and reason in incident log

  6. Proceed to root cause analysis


Procedure B: Data Rollback and Recovery

Use when: Pipeline wrote corrupted data to tables

Steps:

  1. Identify affected tables:

  2. Check table history:

  3. Identify last good version (before corruption):

  4. Restore table to previous version:

  5. Verify data integrity:

  6. Document rollback in incident log

  7. Fix pipeline issue before restarting


Procedure C: Progressive Pipeline Restart

Use when: Restarting pipeline after fixes

Steps:

  1. Verify fix deployed:

  2. Start pipeline in test mode (if available):

  3. Monitor for 10 minutes in dev:

    • Check event log for errors

    • Verify rows scanned/written

    • Check data quality metrics

  4. If dev successful, proceed to staging:

  5. Monitor staging for 20 minutes

  6. If staging successful, restart production:

  7. Monitor production closely for first hour

  8. Set alert for anomalies


Rollback Procedures

Rollback Method 1: Git Revert (Configuration Changes)

Use when: Pipeline failure due to code or configuration changes

Steps:

  1. Identify problematic commit:

  2. Revert the commit:

  3. Push revert:

  4. Wait for CI/CD to deploy:

    • Monitor GitHub Actions

    • Verify deployment to dev → staging → prod

  5. Verify pipeline health after deployment

Timeline: 10-15 minutes (full pipeline)


Rollback Method 2: Pipeline Reset (Checkpoint Issues)

Use when: Checkpoint corruption or state issues

Steps:

  1. Stop pipeline:

  2. Reset pipeline (clears all state):

  3. Restart from beginning:

  4. Monitor full reprocessing

Timeline: Depends on data volume (hours for large datasets)

Warning: Pipeline will reprocess all data from beginning


Rollback Method 3: Table Restore (Data Corruption)

Use when: Pipeline wrote bad data to tables

See Procedure B above for detailed steps.

Timeline: 2-5 minutes per table


When to Escalate

Level 1: Self-Service (0-30 minutes)

Use: This runbook and troubleshooting guide Actions: Follow scenarios and procedures above Success Criteria: Pipeline restored within 30 minutes

Level 2: Team Support (30 minutes - 2 hours)

Escalate to: #ml-pipelines Slack channel Provide:

  • Pipeline ID and name

  • Environment (prod/staging/dev)

  • Error messages from event log

  • Steps already attempted

  • Impact assessment

Contact: [email protected]

Level 3: Production Emergency (2+ hours or critical impact)

Escalate to: Taylor Laing ([email protected]) Use for:

  • Production data pipeline completely down

  • Data corruption affecting downstream systems

  • Multiple pipelines failing simultaneously

  • Customer-facing impact

Actions:

  1. Call emergency contact immediately

  2. Post in #incidents Slack channel

  3. Document all actions taken

  4. Prepare for post-mortem


Prevention Measures

Before Deployment

  1. Test in sandbox first:

  2. Run validation:

  3. Review pipeline SQL:

    • Check for CREATE OR REPLACE on schema changes

    • Verify filter conditions are correct

    • Ensure ai_query has proper timeout settings

  4. Check data expectations:

    • Set realistic thresholds (not too strict)

    • Use expect for warnings, expect_or_drop sparingly

During Deployment

  1. Monitor CI/CD pipeline:

    • Watch GitHub Actions progress

    • Check deployment logs for warnings

  2. Progressive validation:

    • Verify dev deployment before staging

    • Verify staging before production

  3. Staged rollout:

    • Deploy during low-traffic periods

    • Monitor for 1 hour after production deployment

After Deployment

  1. Set up monitoring alerts:

  2. Review data quality metrics:

    • Check expectation failure rates

    • Monitor row counts and processing times

  3. Document changes:

    • Update CHANGELOG.md

    • Note any configuration changes in runbook


Verification Steps

After resolving a pipeline failure, verify:

Final Check:


Last updated