Delta Live Tables (DLT) Pipeline Development Guide

Overview

Delta Live Tables (DLT) is Databricks' declarative framework for building reliable, maintainable, and testable data pipelines. This guide covers everything you need to develop DLT pipelines in the ML Pipelines repository.

DLT Benefits

Why Use DLT?

  1. Automatic Dependency Management: DLT infers table dependencies from queries

  2. Built-in Data Quality: Expectations validate data at ingestion

  3. Automatic Schema Evolution: Handles schema changes gracefully

  4. Streaming & Batch: Unified API for both modes

  5. Error Handling: Quarantine bad data instead of failing pipelines

  6. Observability: Rich metrics and lineage tracking

  7. Cost Optimization: Serverless compute and auto-scaling

When to Use DLT

  • Data ingestion from external sources (S3, APIs, databases)

  • Medallion architecture (bronze/silver/gold transformations)

  • ML feature engineering with data quality requirements

  • Streaming analytics with real-time transformations

  • AI model inference with ai_query integration

Pipeline Structure (Bronze/Silver/Gold)

Medallion Architecture

For complete medallion architecture details, data lineage, and layer responsibilities, see Data Flow Architecture.

Quick Summary:

  • Bronze: Raw data ingestion from S3 volumes (minimal transformation)

  • Silver: Cleaned, validated data with AI model inference (ai_query)

  • Gold: Pre-computed aggregations and analytics-ready tables

Directory Structure

Schema Definitions Using table_schemas Package

Bronze Schema Example

Bronze schemas are defined in /Users/taylorlaing/Development/refresh-os/ml-pipelines/src/table_schemas/bronze_schemas.py:

Using Schemas in DLT

Python DLT Notebook (run_bronze_data_ingestion.py):

Benefits of Centralized Schemas

  1. Single source of truth: Schema changes in one place

  2. Reusability: Share schemas across notebooks and tests

  3. Type safety: IDE autocomplete and validation

  4. Documentation: Inline comments in schema definitions

  5. Testing: Use schemas in unit tests

Data Quality with Expectations

Expectation Types

DLT provides three expectation decorators:

Common Expectations

Bronze Layer (Lenient):

Silver Layer (Strict):

Complex Expectations:

Monitoring Expectations

ai_query Integration Patterns

The ai_query Challenge

The ai_query function has strict schema requirements. See Model Deployment Guide for detailed learnings.

Key Findings:

  1. Models must return 100% stable schemas (no dynamic keys)

  2. All primitives should be stringified in model output

  3. Use two-stage pipeline architecture for reliability

Two-Stage AI Processing Pattern

Stage 1: Raw AI Processing (01_sentiment_features_raw.sql):

Stage 2: Field Extraction (02_sentiment_features.sql):

Benefits of Two-Stage Pattern

  1. Error Isolation: AI failures don't block field extraction

  2. Performance Visibility: See AI processing separately from casting

  3. Debugging: Inspect raw results in Stage 1 table

  4. Schema Evolution: Change field extraction without re-running AI

  5. Cost Efficiency: Reprocess Stage 2 without expensive AI calls

AI Query Configuration

Handling AI Query Errors

Streaming vs Batch Pipelines

Streaming Pipelines

When to Use:

  • Real-time data processing

  • Incremental updates

  • Event-driven architectures

Example:

Python Streaming:

Batch Pipelines

When to Use:

  • Full table refreshes

  • Historical backfills

  • Aggregations over complete datasets

Example:

Hybrid Approach

Catalog and Schema References

Dynamic Catalog Resolution

All pipelines must support multiple catalogs (sandbox/dev/staging/prod):

SQL Catalog References

Cross-Catalog Reads

Error Handling

Rescue Columns (Bronze Layer)

Check rescued data:

Quarantine Tables

Testing DLT Pipelines

Unit Testing Schema Definitions

File: /Users/taylorlaing/Development/refresh-os/ml-pipelines/tests/test_schemas.py

Integration Testing Pipelines

Local Testing (before deployment):

Testing in Sandbox

Common Pitfalls and Solutions

1. Schema Evolution Failures

Problem: New fields added but DLT doesn't pick them up

Solution: Use CREATE OR REPLACE to reset schema:

Or enable schema evolution:

2. Catalog Variable Not Propagating

Problem: Pipeline references wrong catalog

Solution: Ensure configuration block passes variables:

3. Streaming State Conflicts

Problem: Pipeline fails with "State store not found"

Solution: Reset checkpoint location:

Or delete and recreate pipeline.

4. ai_query Schema Errors

Problem: AI_FUNCTION_MODEL_SCHEMA_PARSE_ERROR

Solution: Follow two-stage pattern and ensure model returns stable schema:

  • All model outputs must be strings (no raw floats/ints)

  • No dynamic dictionary keys

  • Use golden example for signature inference

See Model Deployment Guide for full details.

5. Memory Errors with Large Batches

Problem: Executor OOM errors during processing

Solution: Reduce batch size:

6. Slow Performance

Problem: Pipeline takes hours to process

Solutions:

Best Practices

1. Pipeline Organization

  • One pipeline per layer: Separate bronze, silver, gold pipelines

  • Logical grouping: Group related transformations (e.g., sentiment_analysis)

  • File naming: Use numbered prefixes for execution order (01_raw.sql, 02_features.sql)

2. Naming Conventions

3. Comments and Documentation

4. Performance Optimization

  • Use serverless for most pipelines (faster startup, auto-scaling)

  • Enable Photon for complex SQL-heavy workloads

  • Use streaming for incremental processing

  • Batch only for full refreshes

5. Data Quality Strategy

Pipeline Configuration Reference

Full pipeline YAML example (sentiment_analysis.pipeline.yml):

External Resources

Last updated