Debugging Guide

Overview

This guide provides practical debugging strategies for common scenarios in the ML Pipelines project. It's tailored for developers working on DLT pipelines, model training, and ML inference tasks.

Quick Reference: For common operational issues and quick fixes, see the Troubleshooting Guide.

Debugging Philosophy

  1. Reproduce the issue: Can you make it happen consistently?

  2. Isolate the problem: Narrow down to the smallest failing component

  3. Check the logs: Most answers are in the logs

  4. Test hypotheses: Form theories and test them systematically

  5. Ask for help: Don't spend hours stuck - reach out to the team

This debugging guide focuses on developer-oriented debugging techniques. For related information:

Quick Reference

Issue
First Place to Look

DLT pipeline stuck

Event logs for schema conflicts

Model prediction errors

Model serving endpoint logs

Null results from ai_query

Model signature and input format

Permission denied

Service principal grants

Slow pipeline

Spark UI and query plans

Bundle deployment fails

Validation errors in workflow logs

Debugging DLT Pipelines

Accessing DLT Event Logs

Via Databricks UI:

  1. Navigate to WorkflowsDelta Live Tables

  2. Select your pipeline

  3. Click Event Log tab

  4. Filter by Level (ERROR, WARN, INFO) or Event Type

Via CLI:

Common DLT Issues

Pipeline Stuck in RUNNING State

Symptoms:

  • Pipeline shows "RUNNING" for hours

  • No progress in event logs

  • Zero rows scanned/written

Debug Steps:

  1. Check for schema conflicts:

  2. Check table history:

  3. Verify source data exists:

  4. Check streaming checkpoint:

    • Look for checkpoint errors in event logs

    • Checkpoint corruption can cause hangs

Solution:


Zero Rows Scanned/Written

Symptoms:

  • Pipeline completes successfully

  • Tables exist but are empty

  • Event logs show 0 rows processed

Debug Steps:

  1. Verify source has data:

  2. Check filter conditions:

    • Review WHERE clauses in transformations

    • Temporarily remove filters to test

  3. Validate volume paths:

  4. Check for schema conflicts (most common cause):

    • Look for INCOMPATIBLE_VIEW_SCHEMA_CHANGE errors

    • Table schema doesn't match transformation output

Solution:


ai_query Returns NULL

Symptoms:

  • ai_query() calls return NULL for all rows

  • No errors in DLT logs

  • Model endpoint exists and is ready

Debug Steps:

  1. Check model endpoint status:

  2. Test endpoint directly:

  3. Check model signature:

  4. Verify input/output format matches signature:

    • Model expects string but ai_query passes struct?

    • Column name mismatch?

Solution: See Model Deployment Guide for comprehensive fix.


Using Databricks Notebooks for Pipeline Debugging

Interactive Debugging Workflow:

  1. Create debugging notebook:

circle-info

Note: all Databricks note books that are .py files much begin with # Databricks notebook source in order for it to correctly render the blocks in their UI. This pattern is useful given pipelines do not support running python notebooks (.ipynb), only .py or .sql files.

  1. Run transformations step-by-step:

    • Test each SELECT clause independently

    • Validate data at each stage

    • Identify where data is lost or corrupted

  2. Test ai_query interactively:

Debugging Model Training Jobs

Accessing Job Logs

Via Databricks UI:

  1. Navigate to WorkflowsJobs

  2. Select your job (e.g., taylorlaing_register_sentiment_analysis)

  3. Click on a job run

  4. View Output and Logs tabs

Via CLI:

Common Model Training Issues

Model Registration Fails

Symptoms:

  • Job fails during mlflow.pyfunc.log_model()

  • Error: "Model signature invalid"

  • Error: "Artifact not found"

Debug Steps:

  1. Check job logs for exact error message

  2. Validate model signature:

  3. Check artifact paths:

  4. Test model loads locally:

Solution: See Model Deployment Guide


Model Serving Endpoint Creation Fails

Symptoms:

  • Endpoint creation times out

  • Error: "Endpoint already exists"

  • Endpoint stuck in "NOT_READY" state

Debug Steps:

  1. Check if endpoint exists:

  2. Delete stale endpoint if needed:

  3. Check endpoint configuration:

  4. Wait longer - Endpoints can take 10-30 minutes to provision

Solution:


Interactive Model Debugging

Debug Model in Notebook:

  1. Load model from registry:

  2. Test with various inputs:

  3. Inspect model artifacts:

  4. Debug load_context:

Reading Databricks Logs

Log Locations

Log Type
Location
Access Method

DLT Pipeline Logs

Pipeline Event Log

Databricks UI or CLI

Job Logs

Job Run Output

Databricks UI or CLI

Model Serving Logs

Endpoint Logs

Databricks UI

Spark Driver Logs

Job Run → Logs

Databricks UI

Executor Logs

Spark UI → Executors

Databricks UI

Understanding DLT Event Logs

Event Types:

  • FLOW_PROGRESS: Pipeline execution progress

  • EXPECTATION: Data quality check results

  • ERROR: Pipeline errors

  • UPDATE_PROGRESS: Update status changes

Common Error Patterns:

Filtering Logs:

  • Filter by Level: ERROR, WARN, INFO

  • Search for keywords: "INCOMPATIBLE", "FAILED", "Permission"

  • Filter by Event Type: ERROR, EXPECTATION

Understanding Model Serving Logs

Access Logs:

  1. Databricks UI → Machine LearningModel Serving

  2. Select endpoint → Logs tab

Log Fields:

  • timestamp: Request timestamp

  • request_id: Unique request identifier

  • status_code: HTTP status (200, 400, 500)

  • execution_time_ms: Prediction latency

  • error_message: Error details (if failed)

Query Logs:

Performance Debugging

Debugging Slow DLT Pipelines

Check Spark UI:

  1. Open pipeline run

  2. Click Spark UI button

  3. Navigate to SQL / DataFrame tab

  4. Identify slow stages

Common Performance Issues:

Issue
Symptom
Solution

Data skew

Few tasks take much longer

Repartition by key

Small files

Many small input files

Optimize with compaction

Shuffle spill

Executors spilling to disk

Increase executor memory

Wide transformations

Many shuffle operations

Reduce joins/groupBy

Large broadcast

Driver OOM

Reduce broadcast join threshold

Optimization Techniques:

Debugging Slow Model Inference

Check Endpoint Metrics:

Common Causes:

  • Cold start: Endpoint scaling up from zero

  • Model size: Large models load slowly

  • Batch size: Single predictions vs batching

  • Endpoint size: "Small" vs "Medium" workload

Solutions:

Common Error Patterns

Schema Conflicts

Error:

Solution:

Permission Denied

Error:

Solution:

Model Schema Parse Error

Error:

Solution: Follow two-stage pipeline pattern in Model Deployment Guide

Debugging Checklist

Before asking for help, have you:

Last updated