Troubleshooting Guide

Overview

This guide provides quick reference solutions for common issues encountered in the ML Pipelines repository. Issues are organized by category with symptoms, causes, and step-by-step solutions.

For Developers: This guide focuses on operational troubleshooting. For in-depth debugging techniques, see the Debugging Guide.

Table of Contents

Pipeline Failures

Pipeline Stuck in "RUNNING" State

Symptoms:

  • Pipeline shows "RUNNING" for hours

  • No progress in event log

  • Zero rows scanned/written

Causes:

  1. Schema conflict preventing writes

  2. Streaming configuration too aggressive

  3. Checkpoint state corruption

Solutions:

Prevention:

  • Always use CREATE OR REPLACE for schema changes

  • Set appropriate streaming trigger intervals

  • Monitor event logs regularly


AI Query Timeout Errors

Symptoms:

Causes:

  • Model endpoint under heavy load

  • Large batch sizes

  • Complex model inference

Solutions:

Alternative: Use failOnError => false in ai_query call:


Schema Parse Errors (ai_query)

Symptoms:

Causes:

  • Model returns inconsistent schema

  • Dynamic keys in model output

  • Incorrect data types (raw floats instead of strings)

Solutions:

See Model Deployment Guide for comprehensive fix.

Quick Fix:

  1. Ensure model returns ALL fields as strings

  2. Use golden example for signature inference

  3. Implement two-stage pipeline pattern

Model Fix:

Pipeline Fix:


Zero Rows Scanned/Written

Symptoms:

  • Pipeline completes but shows 0 rows scanned

  • Tables exist but are empty

Causes:

  1. Schema conflict (most common)

  2. No new data in source

  3. Filter conditions too restrictive

  4. Volume path incorrect

Solutions:


Expectation Failures

Symptoms:

Causes:

  • Data quality issues in source

  • Invalid model outputs

  • Schema evolution introducing nulls

Solutions:

Model Serving Issues

Model Endpoint Not Found

Symptoms:

Causes:

  • Model not registered

  • Endpoint not created

  • Wrong catalog/model name

Solutions:


Model Returns Null Results

Symptoms:

  • ai_query returns NULL for all rows

  • No errors in logs

Causes:

  • Model endpoint not ready

  • Invalid input format

  • Model signature mismatch

Solutions:

Permission Errors

Catalog Permission Denied

Symptoms:

Causes:

  • Service principal not granted catalog access

  • Wrong service principal used

  • Catalog ownership changed

Solutions:


Cannot Create Sandbox Catalog

Symptoms:

Causes:

  • Missing metastore-level permissions

  • Not in developer group

Solutions:


Volume Access Denied

Symptoms:

Causes:

  • Volume not created

  • Missing S3 bucket permissions

  • External location not configured

Solutions:

Authentication Failures

OIDC Authentication Failed

Symptoms:

Causes:

  • Service principal OIDC not configured

  • GitHub environment mismatch

  • Client ID incorrect

Solutions:


OAuth Token Expired

Symptoms:

Causes:

  • Local OAuth token expired (90 days)

  • Profile not configured

Solutions:

S3 Access Issues

S3 Bucket Not Found

Symptoms:

Causes:

  • Bucket name typo in databricks.yml

  • AWS profile not configured

  • Missing bucket permissions

Solutions:


S3 Access Denied

Symptoms:

Causes:

  • IAM role missing permissions

  • Bucket policy restrictive

  • External location misconfigured

Solutions:

Schema Conflicts

Incompatible Schema Change

Symptoms:

Causes:

  • Column added/removed without DROP/CREATE

  • Column type changed

  • Column renamed

Solutions:


Column Type Mismatch

Symptoms:

Causes:

  • ai_query returns strings but schema expects decimals

  • Null values in numeric columns

  • Empty strings ("") not handled

Solutions:

Deployment Failures

Bundle Validation Failed

Symptoms:

Causes:

  • YAML syntax errors

  • Invalid variable references

  • Missing required fields

Solutions:


Wheel Build Failed

Symptoms:

Causes:

  • Missing pyproject.toml

  • Invalid dependencies

  • UV version too old

Solutions:

GitHub Actions Failures

Workflow Not Triggering

Symptoms:

  • PR merged but dev deployment didn't run

  • No workflow visible in Actions tab

Causes:

  • Workflow file syntax error

  • Path filters excluding changes

  • GitHub Actions disabled

Solutions:


Workflow Fails at Validation Step

Symptoms:

Causes:

  • Environment-specific configuration

  • Missing secrets

  • Different Databricks CLI version

Solutions:


Staging Deployment Not Triggering

Symptoms:

  • Dev deployment succeeds

  • Staging deployment doesn't start

Causes:

  • workflow_run condition not met

  • Staging workflow disabled

  • GitHub environment approval pending

Solutions:

How to Read Logs

DLT Pipeline Event Logs

Location: Databricks UI → Workflows → Delta Live Tables → [Pipeline] → Event Log

Key Fields:

  • Level: INFO, WARN, ERROR

  • Event Type: FLOW_PROGRESS, EXPECTATION, ERROR

  • Message: Detailed error description

Common Patterns:

Useful Filters:

  • Level = ERROR (show only errors)

  • Event Type = EXPECTATION (data quality issues)

  • Search: "INCOMPATIBLE" (schema conflicts)


GitHub Actions Logs

Location: Repository → Actions → [Workflow Run] → [Job] → [Step]

Key Steps to Check:

  1. Checkout - Did code checkout succeed?

  2. Install Databricks CLI - CLI version logged

  3. Validate Bundle - Validation errors appear here

  4. Deploy - Deployment progress and errors

Download Logs:


Model Serving Logs

Location: Databricks UI → Machine Learning → Model Serving → [Endpoint] → Logs

Check for:

  • Request failures (4xx, 5xx status codes)

  • Latency spikes

  • Error messages from model code

Query Logs:

Escalation Procedures

Level 1: Self-Service (Use This Guide)

Timeframe: 0-30 minutes

Actions:

  1. Identify symptom in this guide

  2. Follow troubleshooting steps

  3. Check logs as directed

  4. Try suggested solutions


Level 2: Team Support

Timeframe: 30 minutes - 2 hours

Contact:

Information to Provide:


Level 3: Admin Escalation

Timeframe: 2+ hours or production down

Contact:

Use for:

  • Production outages

  • Data corruption

  • Security incidents

  • Permission issues requiring admin access


Emergency Procedures

For Production Incidents:

  1. Stop the bleeding

  2. Notify team

    • Post in #ml-pipelines: "Production incident - investigating"

    • Update incident channel with status

  3. Assess impact

    • How many users affected?

    • What data is impacted?

    • Can we rollback?

  4. Execute fix or rollback

  5. Post-mortem

    • Document root cause

    • Update this guide

    • Implement preventive measures

Quick Reference

Common Commands

Error Code Reference

Error Code
Meaning
Quick Fix

INCOMPATIBLE_VIEW_SCHEMA_CHANGE

Schema conflict

Use CREATE OR REPLACE

AI_FUNCTION_MODEL_SCHEMA_PARSE_ERROR

Model schema issues

Follow two-stage pattern

Permission denied on catalog

Missing grants

Grant catalog permissions

Token expired

Auth expired

Run databricks auth login

S3 Access Denied

Bucket permissions

Check external location

Pipeline stuck in RUNNING

Schema/checkpoint issue

Stop and use CREATE OR REPLACE

Last updated