Debugging Guide
Overview
This guide provides practical debugging strategies for common scenarios in the ML Pipelines project. It's tailored for developers working on DLT pipelines, model training, and ML inference tasks.
Quick Reference: For common operational issues and quick fixes, see the Troubleshooting Guide.
Debugging Philosophy
Reproduce the issue: Can you make it happen consistently?
Isolate the problem: Narrow down to the smallest failing component
Check the logs: Most answers are in the logs
Test hypotheses: Form theories and test them systematically
Ask for help: Don't spend hours stuck - reach out to the team
Related Documentation
This debugging guide focuses on developer-oriented debugging techniques. For related information:
Troubleshooting Guide - Quick reference for common operational issues
Model Deployment Guide - Model-specific issues and ai_query troubleshooting
DLT Pipelines Guide - Pipeline development best practices
Getting Started Guide - Day 1 setup and common first-day issues
Quick Reference
DLT pipeline stuck
Event logs for schema conflicts
Model prediction errors
Model serving endpoint logs
Null results from ai_query
Model signature and input format
Permission denied
Service principal grants
Slow pipeline
Spark UI and query plans
Bundle deployment fails
Validation errors in workflow logs
Debugging DLT Pipelines
Accessing DLT Event Logs
Via Databricks UI:
Navigate to Workflows → Delta Live Tables
Select your pipeline
Click Event Log tab
Filter by Level (ERROR, WARN, INFO) or Event Type
Via CLI:
Common DLT Issues
Pipeline Stuck in RUNNING State
Symptoms:
Pipeline shows "RUNNING" for hours
No progress in event logs
Zero rows scanned/written
Debug Steps:
Check for schema conflicts:
Check table history:
Verify source data exists:
Check streaming checkpoint:
Look for checkpoint errors in event logs
Checkpoint corruption can cause hangs
Solution:
Zero Rows Scanned/Written
Symptoms:
Pipeline completes successfully
Tables exist but are empty
Event logs show 0 rows processed
Debug Steps:
Verify source has data:
Check filter conditions:
Review WHERE clauses in transformations
Temporarily remove filters to test
Validate volume paths:
Check for schema conflicts (most common cause):
Look for INCOMPATIBLE_VIEW_SCHEMA_CHANGE errors
Table schema doesn't match transformation output
Solution:
ai_query Returns NULL
Symptoms:
ai_query() calls return NULL for all rows
No errors in DLT logs
Model endpoint exists and is ready
Debug Steps:
Check model endpoint status:
Test endpoint directly:
Check model signature:
Verify input/output format matches signature:
Model expects string but ai_query passes struct?
Column name mismatch?
Solution: See Model Deployment Guide for comprehensive fix.
Using Databricks Notebooks for Pipeline Debugging
Interactive Debugging Workflow:
Create debugging notebook:
Note: all Databricks note books that are .py files much begin with # Databricks notebook source in order for it to correctly render the blocks in their UI. This pattern is useful given pipelines do not support running python notebooks (.ipynb), only .py or .sql files.
Run transformations step-by-step:
Test each SELECT clause independently
Validate data at each stage
Identify where data is lost or corrupted
Test ai_query interactively:
Debugging Model Training Jobs
Accessing Job Logs
Via Databricks UI:
Navigate to Workflows → Jobs
Select your job (e.g.,
taylorlaing_register_sentiment_analysis)Click on a job run
View Output and Logs tabs
Via CLI:
Common Model Training Issues
Model Registration Fails
Symptoms:
Job fails during mlflow.pyfunc.log_model()
Error: "Model signature invalid"
Error: "Artifact not found"
Debug Steps:
Check job logs for exact error message
Validate model signature:
Check artifact paths:
Test model loads locally:
Solution: See Model Deployment Guide
Model Serving Endpoint Creation Fails
Symptoms:
Endpoint creation times out
Error: "Endpoint already exists"
Endpoint stuck in "NOT_READY" state
Debug Steps:
Check if endpoint exists:
Delete stale endpoint if needed:
Check endpoint configuration:
Wait longer - Endpoints can take 10-30 minutes to provision
Solution:
Interactive Model Debugging
Debug Model in Notebook:
Load model from registry:
Test with various inputs:
Inspect model artifacts:
Debug load_context:
Reading Databricks Logs
Log Locations
DLT Pipeline Logs
Pipeline Event Log
Databricks UI or CLI
Job Logs
Job Run Output
Databricks UI or CLI
Model Serving Logs
Endpoint Logs
Databricks UI
Spark Driver Logs
Job Run → Logs
Databricks UI
Executor Logs
Spark UI → Executors
Databricks UI
Understanding DLT Event Logs
Event Types:
FLOW_PROGRESS: Pipeline execution progressEXPECTATION: Data quality check resultsERROR: Pipeline errorsUPDATE_PROGRESS: Update status changes
Common Error Patterns:
Filtering Logs:
Filter by Level: ERROR, WARN, INFO
Search for keywords: "INCOMPATIBLE", "FAILED", "Permission"
Filter by Event Type: ERROR, EXPECTATION
Understanding Model Serving Logs
Access Logs:
Databricks UI → Machine Learning → Model Serving
Select endpoint → Logs tab
Log Fields:
timestamp: Request timestamprequest_id: Unique request identifierstatus_code: HTTP status (200, 400, 500)execution_time_ms: Prediction latencyerror_message: Error details (if failed)
Query Logs:
Performance Debugging
Debugging Slow DLT Pipelines
Check Spark UI:
Open pipeline run
Click Spark UI button
Navigate to SQL / DataFrame tab
Identify slow stages
Common Performance Issues:
Data skew
Few tasks take much longer
Repartition by key
Small files
Many small input files
Optimize with compaction
Shuffle spill
Executors spilling to disk
Increase executor memory
Wide transformations
Many shuffle operations
Reduce joins/groupBy
Large broadcast
Driver OOM
Reduce broadcast join threshold
Optimization Techniques:
Debugging Slow Model Inference
Check Endpoint Metrics:
Common Causes:
Cold start: Endpoint scaling up from zero
Model size: Large models load slowly
Batch size: Single predictions vs batching
Endpoint size: "Small" vs "Medium" workload
Solutions:
Common Error Patterns
Schema Conflicts
Error:
Solution:
Permission Denied
Error:
Solution:
Model Schema Parse Error
Error:
Solution: Follow two-stage pipeline pattern in Model Deployment Guide
Debugging Checklist
Before asking for help, have you:
Related Documentation
Troubleshooting Guide - Common issues and solutions
Model Deployment Guide - Model-specific debugging
DLT Pipelines Guide - Pipeline development best practices
Monitoring Guide - Production monitoring and alerts
Last updated