Troubleshooting Guide

Overview

This guide provides quick reference solutions for common issues encountered in the ML Pipelines repository. Issues are organized by category with symptoms, causes, and step-by-step solutions.

For Developers: This guide focuses on operational troubleshooting. For in-depth debugging techniques, see the Debugging Guide.

Debugging Guide - In-depth developer debugging strategies
Service Principals Guide - Authentication and permissions setup
Deployment Guide - CI/CD and deployment procedures
Runbooks - Step-by-step incident response procedures

Pipeline Failures

Pipeline Stuck in "RUNNING" State

Symptoms:

Pipeline shows "RUNNING" for hours
No progress in event log
Zero rows scanned/written

Causes:

Schema conflict preventing writes
Streaming configuration too aggressive
Checkpoint state corruption

Solutions:

# 1. Check for schema conflicts
databricks sql execute "DESCRIBE HISTORY {catalog}.{schema}.{table}" --profile ref-dev

# 2. Stop the pipeline
databricks pipelines stop <pipeline_id> --profile ref-dev

# 3. Use CREATE OR REPLACE to reset schema
# Edit SQL file to use:
CREATE OR REPLACE STREAMING LIVE TABLE table_name AS SELECT ...

# 4. Redeploy
make deploy

# 5. Start fresh update
databricks pipelines start-update <pipeline_id> --profile ref-dev

Prevention:

Always use CREATE OR REPLACE for schema changes
Set appropriate streaming trigger intervals
Monitor event logs regularly

AI Query Timeout Errors

Symptoms:

Error: ai_query timeout: Request exceeded 120s timeout

Causes:

Model endpoint under heavy load
Large batch sizes
Complex model inference

Solutions:

# In pipeline YAML configuration:
configuration:
  # Increase timeout
  "spark.databricks.ai.query.timeout": "180s"

  # Enable retries
  "spark.databricks.ai.query.retryPolicy": "exponential"
  "spark.databricks.ai.query.maxRetries": "3"

  # Reduce concurrency to prevent overload
  "spark.databricks.ai.query.maxConcurrentRequests": "10"

  # Reduce batch size
  "spark.sql.streaming.maxRowsPerTrigger": "500"

Alternative: Use failOnError => false in ai_query call:

ai_query(
  endpoint => "sentiment_analysis",
  request => redacted_content,
  failOnError => false  -- Continue on errors
) AS raw_ai_result

Schema Parse Errors (ai_query)

Symptoms:

AI_FUNCTION_MODEL_SCHEMA_PARSE_ERROR: Unable to parse model output schema

Causes:

Model returns inconsistent schema
Dynamic keys in model output
Incorrect data types (raw floats instead of strings)

Solutions:

See Model Deployment Guide for comprehensive fix.

Quick Fix:

Ensure model returns ALL fields as strings
Use golden example for signature inference
Implement two-stage pipeline pattern

Model Fix:

# In model.py - stringify all outputs
def predict(self, texts):
    results = []
    for text in texts:
        result = {
            "sentiment": "positive",
            "score": str(0.95),  # Convert to string!
            "features": []       # No dynamic keys
        }
        results.append(result)
    return results

Pipeline Fix:

-- Stage 1: Raw AI results
CREATE OR REPLACE STREAMING LIVE TABLE features_raw AS
SELECT
  message_id,
  ai_query('model', content) AS raw_result
FROM STREAM(bronze.messages);

-- Stage 2: Field extraction with casting
CREATE OR REPLACE STREAMING LIVE TABLE features AS
SELECT
  message_id,
  CAST(NULLIF(raw_result.result.score, '') AS DECIMAL(10,6)) AS score
FROM STREAM(LIVE.features_raw);

Zero Rows Scanned/Written

Symptoms:

Pipeline completes but shows 0 rows scanned
Tables exist but are empty

Causes:

Schema conflict (most common)
No new data in source
Filter conditions too restrictive
Volume path incorrect

Solutions:

# 1. Check if source has data
databricks sql execute "SELECT COUNT(*) FROM {catalog}.bronze.messages" --profile ref-dev

# 2. Check for schema conflicts
# Look for INCOMPATIBLE_VIEW_SCHEMA_CHANGE errors in logs

# 3. Reset table with CREATE OR REPLACE
# Edit transformation SQL:
CREATE OR REPLACE STREAMING LIVE TABLE sentiment_features AS ...

# 4. Verify volume paths
databricks fs ls /Volumes/{catalog}/bronze/ --profile ref-dev

# 5. Check filter conditions
# Temporarily remove WHERE clauses to test

# 6. Redeploy and retry
make deploy
databricks pipelines start-update <pipeline_id> --profile ref-dev

Expectation Failures

Symptoms:

Warning: Expectation 'valid_score' failed for 1000 rows

Causes:

Data quality issues in source
Invalid model outputs
Schema evolution introducing nulls

Solutions:

-- 1. Check failed records (use expectations quarantine)
SELECT *
FROM {catalog}.silver.{table}_expectations
WHERE failed_expectation = 'valid_score'
LIMIT 100;

-- 2. Analyze failure patterns
SELECT
  failed_expectation,
  COUNT(*) as failure_count,
  COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () as pct
FROM {catalog}.silver.{table}_expectations
GROUP BY failed_expectation;

-- 3. Adjust expectations if needed
-- Change from expect_or_drop to expect (log only) if acceptable
@dlt.expect("valid_score", "score BETWEEN 0 AND 1")

Model Serving Issues

Model Endpoint Not Found

Symptoms:

Error: Model serving endpoint 'sentiment_analysis' not found

Causes:

Model not registered
Endpoint not created
Wrong catalog/model name

Solutions:

# 1. Check if model exists
databricks sql execute "SHOW MODELS IN {catalog}.models" --profile ref-dev

# 2. List model versions
databricks sql execute "
  SELECT version, stage, creation_timestamp
  FROM {catalog}.models.{model_name}
  ORDER BY version DESC
" --profile ref-dev

# 3. Register model if missing
databricks jobs run-now <register_model_job_id> --profile ref-dev

# 4. Check model serving endpoints
databricks model-serving list --profile ref-dev

# 5. Create endpoint if needed
databricks model-serving create \
  --name sentiment_analysis \
  --config @model_serving_config.json \
  --profile ref-dev

Model Returns Null Results

Symptoms:

ai_query returns NULL for all rows
No errors in logs

Causes:

Model endpoint not ready
Invalid input format
Model signature mismatch

Solutions:

# 1. Check endpoint health
databricks model-serving get sentiment_analysis --profile ref-dev

# 2. Test endpoint directly
curl -X POST "https://{workspace}.cloud.databricks.com/serving-endpoints/sentiment_analysis/invocations" \
  -H "Authorization: Bearer $DATABRICKS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"inputs": ["test message"]}'

# 3. Check model signature
databricks sql execute "
  DESCRIBE MODEL {catalog}.models.sentiment_analysis VERSION 1
" --profile ref-dev

# 4. Verify input format matches signature
# Adjust ai_query call if needed

Permission Errors

Catalog Permission Denied

Symptoms:

Error: Permission denied on catalog 'dev'
User/Service Principal lacks required privileges

Causes:

Service principal not granted catalog access
Wrong service principal used
Catalog ownership changed

Solutions:

# 1. Check current grants
databricks grants get catalog dev --profile ref-dev

# 2. Grant permissions to service principal
databricks sql execute "
  GRANT ALL PRIVILEGES ON CATALOG dev
  TO '03ff99cd-a352-40bb-9d33-414c9ad9e7aa'
" --profile ref-admin

# 3. Verify service principal identity
databricks current-user me --profile ref-dev

# 4. Check databricks.yml uses correct service principal
# targets.dev.run_as.service_principal_name should match

Cannot Create Sandbox Catalog

Symptoms:

Error: User does not have CREATE CATALOG permission

Causes:

Missing metastore-level permissions
Not in developer group

Solutions:

# 1. Check if you have CREATE CATALOG permission
databricks grants get metastore <metastore_id> --profile ref-admin

# 2. Grant CREATE CATALOG to developer group
databricks sql execute "
  GRANT CREATE CATALOG ON METASTORE TO 'Databricks - Dev'
" --profile ref-admin

# 3. Verify you're in developer group
databricks groups list-members 'Databricks - Dev' --profile ref-admin

# 4. Retry sandbox deployment
make deploy

Volume Access Denied

Symptoms:

Error: Access denied on volume /Volumes/{catalog}/bronze/raw_messages

Causes:

Volume not created
Missing S3 bucket permissions
External location not configured

Solutions:

# 1. Check if volume exists
databricks fs ls /Volumes/{catalog}/bronze/ --profile ref-dev

# 2. Create volume if missing
databricks sql execute "
  CREATE EXTERNAL VOLUME IF NOT EXISTS {catalog}.bronze.raw_messages
  LOCATION 's3://bucket/{catalog}/volumes/bronze/raw_messages/'
" --profile ref-dev

# 3. Verify S3 bucket access (AWS CLI)
aws s3 ls s3://bucket/{catalog}/volumes/ --profile ref-ml-core

# 4. Check external location permissions
databricks grants get external-location {location_name} --profile ref-dev

Authentication Failures

OIDC Authentication Failed

Symptoms:

Error: GitHub OIDC authentication failed
Invalid or expired token

Causes:

Service principal OIDC not configured
GitHub environment mismatch
Client ID incorrect

Solutions:

# 1. Verify service principal OIDC configuration
# In Databricks UI:
# Account Console → Service Principals → [SP] → Authentication → GitHub OIDC

# 2. Check GitHub workflow uses correct Client ID
# In .github/workflows/*.yml:
env:
  DATABRICKS_CLIENT_ID: 03ff99cd-a352-40bb-9d33-414c9ad9e7aa

# 3. Verify GitHub environment matches OIDC config
# OIDC: environment=development
# Workflow: environment: development

# 4. Check repository permissions
# Settings → Actions → General → Workflow permissions

OAuth Token Expired

Symptoms:

Error: Token expired
Re-authenticate to continue

Causes:

Local OAuth token expired (90 days)
Profile not configured

Solutions:

# 1. Re-authenticate
databricks auth login --profile ref-dev

# 2. Verify authentication works
databricks current-user me --profile ref-dev

# 3. If still failing, clear token cache
rm -rf ~/.databricks/token-cache.json

# 4. Re-authenticate
databricks auth login --profile ref-dev --configure-cluster

S3 Access Issues

S3 Bucket Not Found

Symptoms:

Error: S3 bucket 'ref-ml-core-dev-workspace-bucket' not found

Causes:

Bucket name typo in databricks.yml
AWS profile not configured
Missing bucket permissions

Solutions:

# 1. Verify bucket exists
aws s3 ls s3://ref-ml-core-dev-workspace-bucket/ --profile ref-ml-core

# 2. Check databricks.yml variable
# targets.dev.variables.s3_bucket should match

# 3. Verify AWS profile configured
aws configure list --profile ref-ml-core

# 4. Check bucket policy allows Databricks access
aws s3api get-bucket-policy --bucket ref-ml-core-dev-workspace-bucket --profile ref-ml-core

S3 Access Denied

Symptoms:

Error: Access Denied
User lacks s3:GetObject permission

Causes:

IAM role missing permissions
Bucket policy restrictive
External location misconfigured

Solutions:

# 1. Check IAM role attached to Databricks
# In AWS Console: IAM → Roles → databricks-*

# 2. Verify bucket policy
aws s3api get-bucket-policy \
  --bucket ref-ml-core-dev-workspace-bucket \
  --profile ref-ml-core

# 3. Check external location in Databricks
databricks external-locations list --profile ref-dev

# 4. Create or update external location
databricks external-locations create \
  --name dev_volumes \
  --url s3://ref-ml-core-dev-workspace-bucket/dev/volumes/ \
  --credential <credential_name> \
  --profile ref-dev

Schema Conflicts

Incompatible Schema Change

Symptoms:

[INCOMPATIBLE_VIEW_SCHEMA_CHANGE] The SQL query has an incompatible schema change
Column 'message_id' cannot be resolved

Causes:

Column added/removed without DROP/CREATE
Column type changed
Column renamed

Solutions:

-- Solution 1: Use CREATE OR REPLACE (recommended)
CREATE OR REPLACE STREAMING LIVE TABLE sentiment_features AS
SELECT ... -- New schema

-- Solution 2: Drop and recreate
DROP TABLE IF EXISTS {catalog}.silver.sentiment_features;
-- Then redeploy pipeline

-- Solution 3: Use ALTER TABLE (for minor changes)
ALTER TABLE {catalog}.silver.sentiment_features
ADD COLUMN new_field STRING;

-- Solution 4: Enable schema evolution (use cautiously)
-- In pipeline YAML:
configuration:
  "spark.databricks.delta.schema.autoMerge.enabled": "true"

Column Type Mismatch

Symptoms:

Error: Cannot cast STRING to DECIMAL
Invalid input syntax for type decimal

Causes:

ai_query returns strings but schema expects decimals
Null values in numeric columns
Empty strings ("") not handled

Solutions:

-- Use NULLIF to handle empty strings
CAST(NULLIF(raw_result.score, '') AS DECIMAL(10,6)) AS score

-- Use COALESCE for nulls
COALESCE(CAST(raw_result.score AS DECIMAL(10,6)), 0.0) AS score

-- Use TRY_CAST to avoid errors
TRY_CAST(raw_result.score AS DECIMAL(10,6)) AS score

-- Combine all three
COALESCE(
  TRY_CAST(NULLIF(raw_result.score, '') AS DECIMAL(10,6)),
  0.0
) AS score

Deployment Failures

Bundle Validation Failed

Symptoms:

Error: databricks.yml validation failed
Invalid configuration

Causes:

YAML syntax errors
Invalid variable references
Missing required fields

Solutions:

# 1. Validate YAML syntax
yamllint databricks.yml

# 2. Validate bundle locally
make validate

# 3. Check for common issues:
# - Indentation (use 2 spaces, not tabs)
# - Missing colons
# - Invalid variable names (${var.catalog} not ${catalog})

# 4. Verify all variable references resolve
grep -r '\${var\.' databricks.yml resources/

# 5. Check required fields present
# - bundle.name
# - targets.{env}.workspace
# - targets.{env}.variables

Wheel Build Failed

Symptoms:

Error: uv build --wheel failed
No module named 'setuptools'

Causes:

Missing pyproject.toml
Invalid dependencies
UV version too old

Solutions:

# 1. Check pyproject.toml exists
ls -la pyproject.toml

# 2. Verify pyproject.toml syntax
cat pyproject.toml | python -c "import toml; import sys; toml.load(sys.stdin)"

# 3. Update UV
pip install --upgrade uv

# 4. Install dependencies
uv sync --dev

# 5. Try building manually
uv build --wheel

# 6. Check build output
ls -la dist/

GitHub Actions Failures

Workflow Not Triggering

Symptoms:

PR merged but dev deployment didn't run
No workflow visible in Actions tab

Causes:

Workflow file syntax error
Path filters excluding changes
GitHub Actions disabled

Solutions:

# 1. Check workflow syntax
yamllint .github/workflows/ml_pipelines_dev_deploy.yml

# 2. Verify path filters match changed files
# In workflow file:
on:
  push:
    paths:
      - 'databricks.yml'
      - 'resources/**'
      - 'src/**'

# 3. Check GitHub Actions enabled
# Repository → Settings → Actions → General

# 4. Manually trigger workflow
# Actions → [Workflow] → Run workflow

# 5. Check workflow run history
# Actions → All workflows

Workflow Fails at Validation Step

Symptoms:

Error: Bundle validation failed in GitHub Actions
Works locally but fails in CI/CD

Causes:

Environment-specific configuration
Missing secrets
Different Databricks CLI version

Solutions:

# 1. Check GitHub Actions logs for exact error
# Actions → [Failed run] → [Step with error]

# 2. Verify environment variables set
# In workflow YAML:
env:
  DATABRICKS_AUTH_TYPE: github-oidc
  DATABRICKS_HOST: https://...
  DATABRICKS_CLIENT_ID: ...

# 3. Test with same CLI version as CI/CD
databricks --version  # In workflow: databricks/setup-cli@main

# 4. Check secrets configured
# Repository → Settings → Secrets → Actions

Staging Deployment Not Triggering

Symptoms:

Dev deployment succeeds
Staging deployment doesn't start

Causes:

workflow_run condition not met
Staging workflow disabled
GitHub environment approval pending

Solutions:

# 1. Check workflow_run configuration
# In ml_pipelines_staging_deploy.yml:
on:
  workflow_run:
    workflows: ['Deploy to Development']
    types: [completed]

# 2. Verify dev workflow name matches exactly
# Name in ml_pipelines_dev_deploy.yml: name: Deploy to Development

# 3. Check if condition
if: ${{ github.event.workflow_run.conclusion == 'success' }}

# 4. Check environment protection rules
# Settings → Environments → staging → Protection rules

# 5. Manually trigger staging deployment
# Actions → Deploy to Staging → Run workflow

How to Read Logs

DLT Pipeline Event Logs

Location: Databricks UI → Workflows → Delta Live Tables → [Pipeline] → Event Log

Key Fields:

Level: INFO, WARN, ERROR
Event Type: FLOW_PROGRESS, EXPECTATION, ERROR
Message: Detailed error description

Common Patterns:

# Schema conflict
[INCOMPATIBLE_VIEW_SCHEMA_CHANGE] The SQL query has an incompatible schema change

# Expectation failure
Expectation 'valid_score' failed: 1000 rows dropped

# Pipeline progress
FLOW_PROGRESS: Processing batch 42, scanned 10000 rows

Useful Filters:

Level = ERROR (show only errors)
Event Type = EXPECTATION (data quality issues)
Search: "INCOMPATIBLE" (schema conflicts)

GitHub Actions Logs

Location: Repository → Actions → [Workflow Run] → [Job] → [Step]

Key Steps to Check:

Checkout - Did code checkout succeed?
Install Databricks CLI - CLI version logged
Validate Bundle - Validation errors appear here
Deploy - Deployment progress and errors

Download Logs:

# Using GitHub CLI
gh run download <run_id>

# View logs
cat <job_id>/*.txt

Model Serving Logs

Location: Databricks UI → Machine Learning → Model Serving → [Endpoint] → Logs

Check for:

Request failures (4xx, 5xx status codes)
Latency spikes
Error messages from model code

Query Logs:

SELECT
  timestamp,
  request_id,
  status_code,
  execution_time_ms,
  error_message
FROM model_serving_logs
WHERE endpoint_name = 'sentiment_analysis'
  AND status_code >= 400
ORDER BY timestamp DESC
LIMIT 100;

Escalation Procedures

Level 1: Self-Service (Use This Guide)

Timeframe: 0-30 minutes

Actions:

Identify symptom in this guide
Follow troubleshooting steps
Check logs as directed
Try suggested solutions

Level 2: Team Support

Timeframe: 30 minutes - 2 hours

Contact:

Slack: #ml-pipelines channel
Email: [email protected]

Information to Provide:

Problem: [Brief description]
Environment: [sandbox/dev/staging/prod]
Pipeline/Job: [name and ID]
Error Message: [exact error text]
Steps Tried: [what you've already attempted]
Logs: [link to GitHub Actions run or Databricks pipeline]

Level 3: Admin Escalation

Timeframe: 2+ hours or production down

Contact:

Taylor Laing ([email protected]) - Admin access
Pager: [On-call rotation]

Use for:

Production outages
Data corruption
Security incidents
Permission issues requiring admin access

Emergency Procedures

For Production Incidents:

Stop the bleeding

# Stop all pipelines
for id in $(databricks pipelines list-pipelines --profile ref-prod --output json | jq -r '.[].pipeline_id'); do
  databricks pipelines stop $id --profile ref-prod
done

Notify team
- Post in #ml-pipelines: "Production incident - investigating"
- Update incident channel with status
Assess impact
- How many users affected?
- What data is impacted?
- Can we rollback?
Execute fix or rollback
- See Deployment Guide - Rollback Procedures
Post-mortem
- Document root cause
- Update this guide
- Implement preventive measures

Quick Reference

Common Commands

# Validate configuration
make validate

# Deploy to sandbox
make deploy

# Check pipeline status
databricks pipelines get <pipeline_id> --profile ref-dev

# View recent logs
databricks pipelines list-updates <pipeline_id> --limit 5 --profile ref-dev

# Check table data
databricks sql execute "SELECT COUNT(*) FROM {catalog}.{schema}.{table}" --profile ref-dev

# Restart pipeline
databricks pipelines stop <pipeline_id> --profile ref-dev
databricks pipelines start-update <pipeline_id> --profile ref-dev

Error Code Reference

Error Code

Meaning

Quick Fix

INCOMPATIBLE_VIEW_SCHEMA_CHANGE

Schema conflict

Use CREATE OR REPLACE

AI_FUNCTION_MODEL_SCHEMA_PARSE_ERROR

Model schema issues

Follow two-stage pattern

Permission denied on catalog

Missing grants

Grant catalog permissions

Token expired

Auth expired

Run databricks auth login

S3 Access Denied

Bucket permissions

Check external location

Pipeline stuck in RUNNING

Schema/checkpoint issue

Stop and use CREATE OR REPLACE

Deployment Guide - Deployment procedures and rollback
Service Principals Guide - Authentication setup
DLT Pipelines Guide - Pipeline development
Model Deployment Guide - ai_query best practices
Configuration Reference - All settings explained

PreviousService Principal Setup Guide NextRunbooks

Last updated 5 months ago

hashtagOverview

hashtagRelated Documentation

hashtagTable of Contents

hashtagPipeline Failures

hashtagPipeline Stuck in "RUNNING" State

hashtagAI Query Timeout Errors

hashtagSchema Parse Errors (ai_query)

hashtagZero Rows Scanned/Written

hashtagExpectation Failures

hashtagModel Serving Issues

hashtagModel Endpoint Not Found

hashtagModel Returns Null Results

hashtagPermission Errors

hashtagCatalog Permission Denied

hashtagCannot Create Sandbox Catalog

hashtagVolume Access Denied

hashtagAuthentication Failures

hashtagOIDC Authentication Failed

hashtagOAuth Token Expired

hashtagS3 Access Issues

hashtagS3 Bucket Not Found

hashtagS3 Access Denied

hashtagSchema Conflicts

hashtagIncompatible Schema Change

hashtagColumn Type Mismatch

hashtagDeployment Failures

hashtagBundle Validation Failed

hashtagWheel Build Failed

hashtagGitHub Actions Failures

hashtagWorkflow Not Triggering

hashtagWorkflow Fails at Validation Step

hashtagStaging Deployment Not Triggering

hashtagHow to Read Logs

hashtagDLT Pipeline Event Logs

hashtagGitHub Actions Logs

hashtagModel Serving Logs

hashtagEscalation Procedures

hashtagLevel 1: Self-Service (Use This Guide)

hashtagLevel 2: Team Support

hashtagLevel 3: Admin Escalation

hashtagEmergency Procedures

hashtagQuick Reference

hashtagCommon Commands

hashtagError Code Reference

hashtagRelated Documentation

Overview

Related Documentation

Table of Contents

Pipeline Failures

Pipeline Stuck in "RUNNING" State

AI Query Timeout Errors

Schema Parse Errors (ai_query)

Zero Rows Scanned/Written

Expectation Failures

Model Serving Issues

Model Endpoint Not Found

Model Returns Null Results

Permission Errors

Catalog Permission Denied

Cannot Create Sandbox Catalog

Volume Access Denied

Authentication Failures

OIDC Authentication Failed

OAuth Token Expired

S3 Access Issues

S3 Bucket Not Found

S3 Access Denied

Schema Conflicts

Incompatible Schema Change

Column Type Mismatch

Deployment Failures

Bundle Validation Failed

Wheel Build Failed

GitHub Actions Failures

Workflow Not Triggering

Workflow Fails at Validation Step

Staging Deployment Not Triggering

How to Read Logs

DLT Pipeline Event Logs

GitHub Actions Logs

Model Serving Logs

Escalation Procedures

Level 1: Self-Service (Use This Guide)

Level 2: Team Support

Level 3: Admin Escalation

Emergency Procedures

Quick Reference

Common Commands

Error Code Reference

Related Documentation