Debugging Guide

Overview

This guide provides practical debugging strategies for common scenarios in the ML Pipelines project. It's tailored for developers working on DLT pipelines, model training, and ML inference tasks.

Quick Reference: For common operational issues and quick fixes, see the Troubleshooting Guide.

Debugging Philosophy

Reproduce the issue: Can you make it happen consistently?
Isolate the problem: Narrow down to the smallest failing component
Check the logs: Most answers are in the logs
Test hypotheses: Form theories and test them systematically
Ask for help: Don't spend hours stuck - reach out to the team

This debugging guide focuses on developer-oriented debugging techniques. For related information:

Troubleshooting Guide - Quick reference for common operational issues
Model Deployment Guide - Model-specific issues and ai_query troubleshooting
DLT Pipelines Guide - Pipeline development best practices
Getting Started Guide - Day 1 setup and common first-day issues

Quick Reference

Issue

First Place to Look

DLT pipeline stuck

Event logs for schema conflicts

Model prediction errors

Model serving endpoint logs

Null results from ai_query

Model signature and input format

Permission denied

Service principal grants

Slow pipeline

Spark UI and query plans

Bundle deployment fails

Validation errors in workflow logs

Debugging DLT Pipelines

Accessing DLT Event Logs

Via Databricks UI:

Navigate to Workflows → Delta Live Tables
Select your pipeline
Click Event Log tab
Filter by Level (ERROR, WARN, INFO) or Event Type

Via CLI:

# Get pipeline updates
databricks pipelines list-updates <pipeline_id> --profile ref-dev

# Get specific update details
databricks pipelines get-update \
  --pipeline-id <pipeline_id> \
  --update-id <update_id> \
  --profile ref-dev

Common DLT Issues

Pipeline Stuck in RUNNING State

Symptoms:

Pipeline shows "RUNNING" for hours
No progress in event logs
Zero rows scanned/written

Debug Steps:

Check for schema conflicts:

# Look for INCOMPATIBLE_VIEW_SCHEMA_CHANGE in logs
databricks pipelines get-update --pipeline-id <id> --update-id <update> --profile ref-dev | grep INCOMPATIBLE

Check table history:

DESCRIBE HISTORY taylorlaing_sandbox.silver.sentiment_analysis;

Verify source data exists:

SELECT COUNT(*) FROM taylorlaing_sandbox.bronze.messages;

Check streaming checkpoint:
- Look for checkpoint errors in event logs
- Checkpoint corruption can cause hangs

Solution:

-- Use CREATE OR REPLACE to reset schema
CREATE OR REPLACE STREAMING LIVE TABLE sentiment_analysis AS
SELECT ...

Zero Rows Scanned/Written

Symptoms:

Pipeline completes successfully
Tables exist but are empty
Event logs show 0 rows processed

Debug Steps:

Verify source has data:

SELECT COUNT(*) FROM taylorlaing_sandbox.bronze.messages
WHERE timestamp >= current_date();

Check filter conditions:
- Review WHERE clauses in transformations
- Temporarily remove filters to test

Validate volume paths:

databricks fs ls /Volumes/taylorlaing_sandbox/bronze/ --profile ref-dev

Check for schema conflicts (most common cause):
- Look for INCOMPATIBLE_VIEW_SCHEMA_CHANGE errors
- Table schema doesn't match transformation output

Solution:

-- Drop and recreate table if schema conflict
DROP TABLE IF EXISTS taylorlaing_sandbox.silver.sentiment_analysis;

-- Or use CREATE OR REPLACE
CREATE OR REPLACE STREAMING LIVE TABLE sentiment_analysis AS ...

ai_query Returns NULL

Symptoms:

ai_query() calls return NULL for all rows
No errors in DLT logs
Model endpoint exists and is ready

Debug Steps:

Check model endpoint status:

databricks model-serving get sentiment_analysis --profile ref-dev

Test endpoint directly:

# Get workspace URL and token
WORKSPACE_URL="https://dbc-a72d6af9-df3d.cloud.databricks.com"
TOKEN="your-databricks-token"

curl -X POST "$WORKSPACE_URL/serving-endpoints/sentiment_analysis/invocations" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"inputs": ["test message"]}'

Check model signature:

import mlflow
model_uri = "models:/taylorlaing_sandbox.models.sentiment_analysis@champion"
model_info = mlflow.models.get_model_info(model_uri)
print(model_info.signature)

Verify input/output format matches signature:
- Model expects string but ai_query passes struct?
- Column name mismatch?

Solution: See Model Deployment Guide for comprehensive fix.

Using Databricks Notebooks for Pipeline Debugging

Interactive Debugging Workflow:

Create debugging notebook:

Note: all Databricks note books that are .py files much begin with # Databricks notebook source in order for it to correctly render the blocks in their UI. This pattern is useful given pipelines do not support running python notebooks (.ipynb), only .py or .sql files.

# Databricks notebook source

# MAGIC %md
# MAGIC # Notebook Title

# COMMAND ----------
# Load pipeline source data
messages = spark.table("taylorlaing_sandbox.bronze.messages")
display(messages.limit(10))

# COMMAND ----------
# Test transformation logic
from pyspark.sql.functions import col

# Copy transformation from pipeline
transformed = messages.select(
    col("message_id"),
    col("content"),
    # ... rest of transformation
)
display(transformed.limit(10))

# COMMAND ----------
# Check for nulls and data quality
transformed.select([
    col(c).isNull().alias(c) for c in transformed.columns
]).summary("count", "sum").show()

Run transformations step-by-step:
- Test each SELECT clause independently
- Validate data at each stage
- Identify where data is lost or corrupted

Test ai_query interactively:

# COMMAND ----------
# Test ai_query with small sample
test_df = messages.limit(5)

result = test_df.selectExpr(
    "message_id",
    "ai_query('sentiment_analysis', content) as sentiment_result"
)

display(result)

# Check for NULL results
result.filter(col("sentiment_result").isNull()).show()

Debugging Model Training Jobs

Accessing Job Logs

Via Databricks UI:

Navigate to Workflows → Jobs
Select your job (e.g., taylorlaing_register_sentiment_analysis)
Click on a job run
View Output and Logs tabs

Via CLI:

# List recent job runs
databricks jobs list-runs --job-id <job_id> --profile ref-dev

# Get specific run output
databricks jobs get-run --run-id <run_id> --profile ref-dev

# Get run logs
databricks jobs get-run-output --run-id <run_id> --profile ref-dev

Common Model Training Issues

Model Registration Fails

Symptoms:

Job fails during mlflow.pyfunc.log_model()
Error: "Model signature invalid"
Error: "Artifact not found"

Debug Steps:

Check job logs for exact error message

Validate model signature:

# In notebook cell
from mlflow.models.signature import infer_signature

# Test signature inference
input_example = "Test message"
output_example = model.predict(input_example)

signature = infer_signature(input_example, output_example)
print(signature)

Check artifact paths:

# Verify artifacts exist
import os
print(os.path.exists(affect_mapping_config_path))
print(os.path.exists(roberta_local_path))

Test model loads locally:

# After logging model
model = mlflow.pyfunc.load_model(f"runs:/{run_id}/model")
result = model.predict("test")
print(result)

Solution: See Model Deployment Guide

Model Serving Endpoint Creation Fails

Symptoms:

Endpoint creation times out
Error: "Endpoint already exists"
Endpoint stuck in "NOT_READY" state

Debug Steps:

Check if endpoint exists:

databricks model-serving get sentiment_analysis --profile ref-dev

Delete stale endpoint if needed:

databricks model-serving delete sentiment_analysis --profile ref-dev

Check endpoint configuration:

from mlflow.deployments import get_deploy_client
client = get_deploy_client("databricks")

# Get endpoint details
endpoint = client.get_endpoint("sentiment_analysis")
print(endpoint)

Wait longer - Endpoints can take 10-30 minutes to provision

Solution:

# Robust endpoint creation with retry logic
import time

def create_or_update_endpoint(client, name, config, max_wait=1800):
    """Create or update endpoint with wait"""
    try:
        endpoint = client.get_endpoint(name)
        print(f"Endpoint {name} exists, updating...")
        client.update_endpoint(endpoint=name, config=config)
    except Exception:
        print(f"Creating endpoint {name}...")
        client.create_endpoint(name=name, config=config)

    # Wait for ready
    start_time = time.time()
    while time.time() - start_time < max_wait:
        status = client.get_endpoint(name)
        if status.get("state", {}).get("ready") == "READY":
            print(f"Endpoint {name} is ready!")
            return
        time.sleep(15)

    raise TimeoutError(f"Endpoint {name} not ready after {max_wait}s")

Interactive Model Debugging

Debug Model in Notebook:

Load model from registry:

# COMMAND ----------
import mlflow

model_uri = "models:/taylorlaing_sandbox.models.sentiment_analysis@champion"
model = mlflow.pyfunc.load_model(model_uri)

Test with various inputs:

# COMMAND ----------
test_cases = [
    "I love this!",
    "This is terrible.",
    "",  # Empty string
    None,  # None
    "🎉" * 100,  # Emojis
    "a" * 10000,  # Very long text
]

for text in test_cases:
    try:
        result = model.predict(text)
        print(f"Input: {text[:50]}...")
        print(f"Result: {result}")
    except Exception as e:
        print(f"Input: {text[:50]}... FAILED: {e}")

Inspect model artifacts:

# COMMAND ----------
# Download model artifacts locally
import mlflow.artifacts

local_path = mlflow.artifacts.download_artifacts(model_uri)
print(f"Model artifacts downloaded to: {local_path}")

# Explore artifact structure
import os
for root, dirs, files in os.walk(local_path):
    for file in files:
        print(os.path.join(root, file))

Debug load_context:

# COMMAND ----------
# Test model initialization
from models.internal.sentiment_analysis.model import SentimentAnalysisModel

model_instance = SentimentAnalysisModel("taylorlaing_sandbox")

# Mock context for testing
class MockContext:
    def __init__(self):
        self.artifacts = {
            "roberta_model": "/path/to/roberta",
            "affect_mapping_config": "/path/to/config.json"
        }

model_instance.load_context(MockContext())

# Test prediction
result = model_instance.predict("Test message")
print(result)

Reading Databricks Logs

Log Locations

Log Type

Location

Access Method

DLT Pipeline Logs

Pipeline Event Log

Databricks UI or CLI

Job Logs

Job Run Output

Databricks UI or CLI

Model Serving Logs

Endpoint Logs

Databricks UI

Spark Driver Logs

Job Run → Logs

Databricks UI

Executor Logs

Spark UI → Executors

Databricks UI

Understanding DLT Event Logs

Event Types:

FLOW_PROGRESS: Pipeline execution progress
EXPECTATION: Data quality check results
ERROR: Pipeline errors
UPDATE_PROGRESS: Update status changes

Common Error Patterns:

[INCOMPATIBLE_VIEW_SCHEMA_CHANGE]
→ Schema changed without DROP/CREATE

AI_FUNCTION_MODEL_SCHEMA_PARSE_ERROR
→ Model output schema inconsistent

[STREAM_FAILED]
→ Streaming query failed (check cause)

Permission denied on table
→ Check catalog grants

Filtering Logs:

Filter by Level: ERROR, WARN, INFO
Search for keywords: "INCOMPATIBLE", "FAILED", "Permission"
Filter by Event Type: ERROR, EXPECTATION

Understanding Model Serving Logs

Access Logs:

Databricks UI → Machine Learning → Model Serving
Select endpoint → Logs tab

Log Fields:

timestamp: Request timestamp
request_id: Unique request identifier
status_code: HTTP status (200, 400, 500)
execution_time_ms: Prediction latency
error_message: Error details (if failed)

Query Logs:

-- Find failed requests
SELECT
  timestamp,
  request_id,
  status_code,
  error_message
FROM model_serving_logs
WHERE endpoint_name = 'sentiment_analysis'
  AND status_code >= 400
ORDER BY timestamp DESC
LIMIT 100;

-- Analyze latency
SELECT
  percentile_approx(execution_time_ms, 0.5) as p50_latency,
  percentile_approx(execution_time_ms, 0.95) as p95_latency,
  percentile_approx(execution_time_ms, 0.99) as p99_latency
FROM model_serving_logs
WHERE endpoint_name = 'sentiment_analysis'
  AND timestamp >= current_timestamp() - INTERVAL 1 HOUR;

Performance Debugging

Debugging Slow DLT Pipelines

Check Spark UI:

Open pipeline run
Click Spark UI button
Navigate to SQL / DataFrame tab
Identify slow stages

Common Performance Issues:

Issue

Symptom

Solution

Data skew

Few tasks take much longer

Repartition by key

Small files

Many small input files

Optimize with compaction

Shuffle spill

Executors spilling to disk

Increase executor memory

Wide transformations

Many shuffle operations

Reduce joins/groupBy

Large broadcast

Driver OOM

Reduce broadcast join threshold

Optimization Techniques:

# Repartition to avoid skew
df.repartition(200, "partition_key")

# Optimize file sizes
spark.sql("OPTIMIZE delta.`/path/to/table`")

# Enable adaptive query execution
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

# Tune broadcast threshold
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10485760")  # 10MB

Debugging Slow Model Inference

Check Endpoint Metrics:

# Get endpoint status
databricks model-serving get sentiment_analysis --profile ref-dev

Common Causes:

Cold start: Endpoint scaling up from zero
Model size: Large models load slowly
Batch size: Single predictions vs batching
Endpoint size: "Small" vs "Medium" workload

Solutions:

# Disable scale-to-zero for latency-sensitive apps
endpoint_config = {
    "served_entities": [{
        "scale_to_zero_enabled": False,
        "workload_size": "Medium"  # Upgrade to Medium
    }]
}

# Batch predictions in pipeline
# Instead of ai_query per row, batch in groups

Common Error Patterns

Schema Conflicts

Error:

[INCOMPATIBLE_VIEW_SCHEMA_CHANGE] The SQL query has an incompatible schema change.
Column 'message_id' cannot be resolved.

Solution:

CREATE OR REPLACE STREAMING LIVE TABLE table_name AS ...

Permission Denied

Error:

Permission denied on catalog 'dev'
User/Service Principal lacks required privileges

Solution:

# Check grants
databricks grants get catalog dev --profile ref-dev

# Grant permissions
databricks sql execute "
  GRANT ALL PRIVILEGES ON CATALOG dev
  TO '03ff99cd-a352-40bb-9d33-414c9ad9e7aa'
" --profile ref-admin

Model Schema Parse Error

Error:

AI_FUNCTION_MODEL_SCHEMA_PARSE_ERROR: Unable to parse model output schema

Solution: Follow two-stage pipeline pattern in Model Deployment Guide

Debugging Checklist

Before asking for help, have you:

Troubleshooting Guide - Common issues and solutions
Model Deployment Guide - Model-specific debugging
DLT Pipelines Guide - Pipeline development best practices
Monitoring Guide - Production monitoring and alerts

PreviousCode Standards and PR Process NextDelta Live Tables (DLT) Pipeline Development Guide

Last updated 5 months ago

hashtagOverview

hashtagDebugging Philosophy

hashtagRelated Documentation

hashtagQuick Reference

hashtagDebugging DLT Pipelines

hashtagAccessing DLT Event Logs

hashtagCommon DLT Issues

hashtagPipeline Stuck in RUNNING State

hashtagZero Rows Scanned/Written

hashtagai_query Returns NULL

hashtagUsing Databricks Notebooks for Pipeline Debugging

hashtagDebugging Model Training Jobs

hashtagAccessing Job Logs

hashtagCommon Model Training Issues

hashtagModel Registration Fails

hashtagModel Serving Endpoint Creation Fails

hashtagInteractive Model Debugging

hashtagReading Databricks Logs

hashtagLog Locations

hashtagUnderstanding DLT Event Logs

hashtagUnderstanding Model Serving Logs

hashtagPerformance Debugging

hashtagDebugging Slow DLT Pipelines

hashtagDebugging Slow Model Inference

hashtagCommon Error Patterns

hashtagSchema Conflicts

hashtagPermission Denied

hashtagModel Schema Parse Error

hashtagDebugging Checklist

hashtagRelated Documentation

Overview

Debugging Philosophy

Related Documentation

Quick Reference

Debugging DLT Pipelines

Accessing DLT Event Logs

Common DLT Issues

Pipeline Stuck in RUNNING State

Zero Rows Scanned/Written

ai_query Returns NULL

Using Databricks Notebooks for Pipeline Debugging

Debugging Model Training Jobs

Accessing Job Logs

Common Model Training Issues

Model Registration Fails

Model Serving Endpoint Creation Fails

Interactive Model Debugging

Reading Databricks Logs

Log Locations

Understanding DLT Event Logs

Understanding Model Serving Logs

Performance Debugging

Debugging Slow DLT Pipelines

Debugging Slow Model Inference

Common Error Patterns

Schema Conflicts

Permission Denied

Model Schema Parse Error

Debugging Checklist

Related Documentation