Pipeline Failure Runbook

Purpose

This runbook provides step-by-step procedures for handling DLT (Delta Live Tables) pipeline failures in production and non-production environments. Follow these procedures to quickly identify, resolve, and prevent pipeline failures.

Quick Assessment Checklist

When a pipeline failure is detected, work through this checklist:

Environment: Which environment? (sandbox/dev/staging/prod)
Pipeline: Which pipeline failed? (Get pipeline ID and name)
Timing: When did it fail? (Check timestamp)
Impact: Is this blocking production data flow?
Symptoms: What error appears in logs?
Previous Success: When did it last run successfully?
Recent Changes: Were there recent deployments?

Common Failure Scenarios

Scenario 1: Pipeline Stuck in RUNNING State

Symptoms:

Pipeline shows "RUNNING" for hours without progress
No rows scanned or written
Event log shows no activity
Zero progress on data processing

Decision Tree:

Pipeline stuck RUNNING?
├─ Yes → Check event log for errors
│  ├─ Schema conflict detected → Go to Step 1
│  ├─ Checkpoint corruption → Go to Step 2
│  └─ Resource exhaustion → Go to Step 3
└─ No → Check other scenarios

Step 1: Schema Conflict Resolution

Check for schema conflicts in event log:

databricks pipelines get <pipeline_id> --profile ref-prod

Look for error: INCOMPATIBLE_VIEW_SCHEMA_CHANGE

Stop the pipeline:

databricks pipelines stop <pipeline_id> --profile ref-prod

Edit the problematic transformation to use CREATE OR REPLACE:

-- Change from:
CREATE STREAMING LIVE TABLE sentiment_features AS ...

-- To:
CREATE OR REPLACE STREAMING LIVE TABLE sentiment_features AS ...

Redeploy the pipeline:

# If in prod, revert commit and push
git revert <bad_commit_sha>
git push origin main

# If in dev/sandbox
make deploy

Start fresh pipeline update:

databricks pipelines start-update <pipeline_id> --profile ref-prod

Verify: Monitor pipeline progress for 5-10 minutes

Step 2: Checkpoint Corruption

Stop the pipeline:

databricks pipelines stop <pipeline_id> --profile ref-prod

Reset the pipeline (clears checkpoint state):

databricks pipelines reset <pipeline_id> --profile ref-prod

Start fresh update:

databricks pipelines start-update <pipeline_id> --profile ref-prod

Verify: Check event log shows "Processing batch 1"

Step 3: Resource Exhaustion

Check cluster configuration in pipeline YAML:

cluster:
  autoscale:
    min_workers: 1
    max_workers: 5  # Increase if needed

Increase cluster size if necessary (requires redeployment)
Stop and restart pipeline with new configuration

When to Escalate: If pipeline still stuck after 30 minutes, escalate to Level 2 support

Scenario 2: AI Query Timeout Errors

Symptoms:

Error: ai_query timeout: Request exceeded 120s timeout
Pipeline failed after partial processing
Model endpoint shows heavy load

Decision Tree:

AI query timeout?
├─ Consistent timeouts → Increase timeout settings (Step 1)
├─ Intermittent timeouts → Add retry logic (Step 2)
└─ All requests fail → Check model endpoint health (Step 3)

Step 1: Increase Timeout Settings

Stop the failing pipeline:

databricks pipelines stop <pipeline_id> --profile ref-prod

Edit pipeline YAML configuration:

configuration:
  # Increase timeout from default 120s
  "spark.databricks.ai.query.timeout": "300s"

  # Add retry policy
  "spark.databricks.ai.query.retryPolicy": "exponential"
  "spark.databricks.ai.query.maxRetries": "3"

  # Reduce concurrent requests
  "spark.databricks.ai.query.maxConcurrentRequests": "10"

Commit and deploy changes:

git add resources/pipelines/
git commit -m "Increase ai_query timeout for production load"
git push origin main

Wait for CI/CD deployment (dev → staging → prod)
Verify: Monitor pipeline runs successfully with new settings

Step 2: Add Retry Logic to Pipeline

Modify transformation to handle failures gracefully:

CREATE OR REPLACE STREAMING LIVE TABLE sentiment_features AS
SELECT
  message_id,
  ai_query(
    endpoint => 'sentiment_analysis',
    request => content,
    failOnError => false  -- Continue on timeout
  ) AS sentiment_result
FROM STREAM(LIVE.messages_preprocessed);

Create recovery pipeline to retry failed records:

CREATE OR REPLACE STREAMING LIVE TABLE sentiment_features_retry AS
SELECT
  message_id,
  ai_query(
    endpoint => 'sentiment_analysis',
    request => content
  ) AS sentiment_result
FROM STREAM(LIVE.sentiment_features)
WHERE sentiment_result IS NULL;

Deploy and restart pipeline

Step 3: Check Model Endpoint Health

Check endpoint status:

databricks model-serving get sentiment_analysis --profile ref-prod

If endpoint is down or unhealthy:

# Restart endpoint
databricks model-serving stop sentiment_analysis --profile ref-prod
databricks model-serving start sentiment_analysis --profile ref-prod

Test endpoint directly:

curl -X POST "https://dbc-028d9e53-7ce6.cloud.databricks.com/serving-endpoints/sentiment_analysis/invocations" \
  -H "Authorization: Bearer $DATABRICKS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"inputs": ["test message"]}'

If endpoint responds, restart pipeline

When to Escalate: If endpoint is consistently slow or failing, escalate for model optimization

Scenario 3: Zero Rows Scanned/Written

Symptoms:

Pipeline completes successfully
Event log shows 0 rows scanned
Tables exist but are empty
No errors in logs

Decision Tree:

Zero rows processed?
├─ Source table empty → Check upstream pipeline (Step 1)
├─ Filter too restrictive → Review WHERE clauses (Step 2)
└─ Volume path incorrect → Verify volume configuration (Step 3)

Step 1: Check Source Data

Verify source table has data:

databricks sql execute "SELECT COUNT(*) FROM prod.bronze.messages" --profile ref-prod

If source is empty, check upstream pipeline:

databricks pipelines get <upstream_pipeline_id> --profile ref-prod

If upstream failed, resolve that pipeline first (recursive)
Verify: Source table has records before proceeding

Step 2: Review Filter Conditions

Check transformation SQL for restrictive filters:

-- Example issue:
WHERE timestamp > CURRENT_TIMESTAMP()  -- Always false!
WHERE score > 1.0  -- Impossible if score is 0-1

Temporarily disable filters to test:

CREATE OR REPLACE STREAMING LIVE TABLE test_without_filter AS
SELECT * FROM STREAM(LIVE.source_table);
-- No WHERE clause

If rows appear, adjust filter logic
Verify: Pipeline processes expected number of rows

Step 3: Verify Volume Paths

Check volume exists and is accessible:

databricks fs ls /Volumes/prod/bronze/raw_messages/ --profile ref-prod

If volume missing, check pipeline volume configuration:

resources:
  volumes:
    raw_messages:
      catalog_name: ${var.catalog}
      schema_name: bronze
      name: raw_messages
      volume_type: EXTERNAL
      storage_location: s3://bucket/prod/volumes/bronze/raw_messages/

Verify S3 bucket exists and has data:

aws s3 ls s3://ref-ml-core-prod-workspace-bucket/prod/volumes/bronze/raw_messages/ --profile ref-ml-core

If volume path wrong, update and redeploy

When to Escalate: If data exists but pipeline not reading it, escalate to data engineering team

Scenario 4: Data Quality Expectation Failures

Symptoms:

Warning: Expectation 'valid_score' failed for 1000 rows
Rows dropped due to constraint violations
Data quality metrics degraded

Decision Tree:

Expectation failures?
├─ Acceptable failure rate (<5%) → Log and monitor (Step 1)
├─ High failure rate (>20%) → Investigate root cause (Step 2)
└─ Complete failure (100%) → Schema change or model issue (Step 3)

Step 1: Log and Monitor (Low Failure Rate)

Query failed records:

SELECT *
FROM prod.silver.sentiment_features_expectations
WHERE failed_expectation = 'valid_score'
LIMIT 100;

Analyze failure patterns:

SELECT
  failed_expectation,
  COUNT(*) as failure_count,
  COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () as failure_pct
FROM prod.silver.sentiment_features_expectations
GROUP BY failed_expectation;

If failure rate is acceptable (<5%), adjust expectation to log only:

# Change from:
@dlt.expect_or_drop("valid_score", "score BETWEEN 0 AND 1")

# To:
@dlt.expect("valid_score", "score BETWEEN 0 AND 1")  # Log only

Document decision and monitor trend

Step 2: Investigate Root Cause (High Failure Rate)

Check model output schema:

databricks sql execute "DESCRIBE prod.models.sentiment_analysis" --profile ref-prod

Test model endpoint with sample data:

curl -X POST "https://dbc-028d9e53-7ce6.cloud.databricks.com/serving-endpoints/sentiment_analysis/invocations" \
  -H "Authorization: Bearer $DATABRICKS_TOKEN" \
  -d '{"inputs": ["Sample text for testing"]}'

If model returns invalid values, roll back to previous version:

from mlflow.tracking import MlflowClient
client = MlflowClient()
client.set_registered_model_alias(
    "prod.models.sentiment_analysis",
    "champion",
    "5"  # Previous working version
)

Restart pipeline with rolled-back model

Step 3: Schema Change or Model Issue (Complete Failure)

Check if recent model update:

databricks sql execute "
  SELECT version, creation_timestamp, tags
  FROM prod.models.sentiment_analysis
  ORDER BY version DESC
  LIMIT 5
" --profile ref-prod

Roll back to last known good model version

If schema changed, update transformation:

-- Add null handling and type casting
CAST(NULLIF(sentiment_result.score, '') AS DECIMAL(10,6)) AS score

Redeploy pipeline with schema fixes

When to Escalate: If data quality issues persist after model rollback, escalate to ML engineering team

Step-by-Step Resolution Procedures

Procedure A: Emergency Pipeline Stop

Use when: Pipeline is corrupting data or causing cascading failures

Steps:

Stop the problematic pipeline immediately:

databricks pipelines stop <pipeline_id> --profile ref-prod

Verify pipeline stopped:

databricks pipelines get <pipeline_id> --profile ref-prod | grep state

Check downstream pipelines for impact:

# List all pipelines that depend on this one
databricks sql execute "
  SELECT pipeline_name, state
  FROM system.compute.pipelines
  WHERE state = 'RUNNING'
" --profile ref-prod

Stop dependent pipelines if necessary
Document stop time and reason in incident log
Proceed to root cause analysis

Procedure B: Data Rollback and Recovery

Use when: Pipeline wrote corrupted data to tables

Steps:

Identify affected tables:

databricks sql execute "SHOW TABLES IN prod.silver" --profile ref-prod

Check table history:

DESCRIBE HISTORY prod.silver.sentiment_features;

Identify last good version (before corruption):

SELECT version, timestamp, operation, operationMetrics
FROM (DESCRIBE HISTORY prod.silver.sentiment_features)
ORDER BY version DESC;

Restore table to previous version:

RESTORE TABLE prod.silver.sentiment_features TO VERSION AS OF 42;

Verify data integrity:

SELECT COUNT(*), MIN(timestamp), MAX(timestamp)
FROM prod.silver.sentiment_features;

Document rollback in incident log
Fix pipeline issue before restarting

Procedure C: Progressive Pipeline Restart

Use when: Restarting pipeline after fixes

Steps:

Verify fix deployed:

git log -1 --oneline
# Check GitHub Actions deployment status

Start pipeline in test mode (if available):

databricks pipelines start-update <pipeline_id> --profile ref-dev

Monitor for 10 minutes in dev:
- Check event log for errors
- Verify rows scanned/written
- Check data quality metrics

If dev successful, proceed to staging:

databricks pipelines start-update <pipeline_id> --profile ref-staging

Monitor staging for 20 minutes

If staging successful, restart production:

databricks pipelines start-update <pipeline_id> --profile ref-prod

Monitor production closely for first hour
Set alert for anomalies

Rollback Procedures

Rollback Method 1: Git Revert (Configuration Changes)

Use when: Pipeline failure due to code or configuration changes

Steps:

Identify problematic commit:
```
git log --oneline -10
```
Revert the commit:
```
git revert <commit_sha>
```
Push revert:
```
git push origin main
```
Wait for CI/CD to deploy:
- Monitor GitHub Actions
- Verify deployment to dev → staging → prod
Verify pipeline health after deployment

Timeline: 10-15 minutes (full pipeline)

Rollback Method 2: Pipeline Reset (Checkpoint Issues)

Use when: Checkpoint corruption or state issues

Steps:

Stop pipeline:

databricks pipelines stop <pipeline_id> --profile ref-prod

Reset pipeline (clears all state):

databricks pipelines reset <pipeline_id> --profile ref-prod

Restart from beginning:

databricks pipelines start-update <pipeline_id> --profile ref-prod

Monitor full reprocessing

Timeline: Depends on data volume (hours for large datasets)

Warning: Pipeline will reprocess all data from beginning

Rollback Method 3: Table Restore (Data Corruption)

Use when: Pipeline wrote bad data to tables

See Procedure B above for detailed steps.

Timeline: 2-5 minutes per table

When to Escalate

Level 1: Self-Service (0-30 minutes)

Use: This runbook and troubleshooting guide Actions: Follow scenarios and procedures above Success Criteria: Pipeline restored within 30 minutes

Level 2: Team Support (30 minutes - 2 hours)

Escalate to: #ml-pipelines Slack channel Provide:

Pipeline ID and name
Environment (prod/staging/dev)
Error messages from event log
Steps already attempted
Impact assessment

Contact: [email protected]

Level 3: Production Emergency (2+ hours or critical impact)

Escalate to: Taylor Laing ([email protected]) Use for:

Production data pipeline completely down
Data corruption affecting downstream systems
Multiple pipelines failing simultaneously
Customer-facing impact

Actions:

Call emergency contact immediately
Post in #incidents Slack channel
Document all actions taken
Prepare for post-mortem

Prevention Measures

Before Deployment

Test in sandbox first:
```
make deploy  # Test in personal sandbox
```
Run validation:
```
make validate
```
Review pipeline SQL:
- Check for CREATE OR REPLACE on schema changes
- Verify filter conditions are correct
- Ensure ai_query has proper timeout settings
Check data expectations:
- Set realistic thresholds (not too strict)
- Use expect for warnings, expect_or_drop sparingly

During Deployment

Monitor CI/CD pipeline:
- Watch GitHub Actions progress
- Check deployment logs for warnings
Progressive validation:
- Verify dev deployment before staging
- Verify staging before production
Staged rollout:
- Deploy during low-traffic periods
- Monitor for 1 hour after production deployment

After Deployment

Set up monitoring alerts:

-- Alert on pipeline failures
SELECT pipeline_name, state, error_message
FROM system.compute.pipelines
WHERE state = 'FAILED'
  AND timestamp > CURRENT_TIMESTAMP() - INTERVAL 1 HOUR;

Review data quality metrics:
- Check expectation failure rates
- Monitor row counts and processing times
Document changes:
- Update CHANGELOG.md
- Note any configuration changes in runbook

Verification Steps

After resolving a pipeline failure, verify:

Pipeline status shows "SUCCEEDED"
Row counts match expected values
Data quality expectations passing at normal rate (<5% failures)
Downstream pipelines running successfully
No error messages in event log
Monitoring dashboards show normal metrics
Incident documented in log

Final Check:

# Get pipeline status
databricks pipelines get <pipeline_id> --profile ref-prod

# Verify data
databricks sql execute "
  SELECT
    COUNT(*) as total_rows,
    MAX(timestamp) as latest_data,
    CURRENT_TIMESTAMP() - MAX(timestamp) as data_lag
  FROM prod.silver.sentiment_features
" --profile ref-prod

Troubleshooting Guide - Detailed error messages and solutions
Deployment Guide - Deployment procedures
DLT Pipelines Guide - Pipeline development best practices
Monitoring Guide - Setting up alerts and dashboards
Model Serving Issues Runbook - Model endpoint troubleshooting

PreviousModel Serving Issues Runbook NextSecret Rotation Runbook

Last updated 5 months ago

hashtagPurpose

hashtagQuick Assessment Checklist

hashtagCommon Failure Scenarios

hashtagScenario 1: Pipeline Stuck in RUNNING State

hashtagScenario 2: AI Query Timeout Errors

hashtagScenario 3: Zero Rows Scanned/Written

hashtagScenario 4: Data Quality Expectation Failures

hashtagStep-by-Step Resolution Procedures

hashtagProcedure A: Emergency Pipeline Stop

hashtagProcedure B: Data Rollback and Recovery

hashtagProcedure C: Progressive Pipeline Restart

hashtagRollback Procedures

hashtagRollback Method 1: Git Revert (Configuration Changes)

hashtagRollback Method 2: Pipeline Reset (Checkpoint Issues)

hashtagRollback Method 3: Table Restore (Data Corruption)

hashtagWhen to Escalate

hashtagLevel 1: Self-Service (0-30 minutes)

hashtagLevel 2: Team Support (30 minutes - 2 hours)

hashtagLevel 3: Production Emergency (2+ hours or critical impact)

hashtagPrevention Measures

hashtagBefore Deployment

hashtagDuring Deployment

hashtagAfter Deployment

hashtagVerification Steps

hashtagRelated Documentation

Purpose

Quick Assessment Checklist

Common Failure Scenarios

Scenario 1: Pipeline Stuck in RUNNING State

Scenario 2: AI Query Timeout Errors

Scenario 3: Zero Rows Scanned/Written

Scenario 4: Data Quality Expectation Failures

Step-by-Step Resolution Procedures

Procedure A: Emergency Pipeline Stop

Procedure B: Data Rollback and Recovery

Procedure C: Progressive Pipeline Restart

Rollback Procedures

Rollback Method 1: Git Revert (Configuration Changes)

Rollback Method 2: Pipeline Reset (Checkpoint Issues)

Rollback Method 3: Table Restore (Data Corruption)

When to Escalate

Level 1: Self-Service (0-30 minutes)

Level 2: Team Support (30 minutes - 2 hours)

Level 3: Production Emergency (2+ hours or critical impact)

Prevention Measures

Before Deployment

During Deployment

After Deployment

Verification Steps

Related Documentation