Delta Live Tables (DLT) Pipeline Development Guide

Overview

Delta Live Tables (DLT) is Databricks' declarative framework for building reliable, maintainable, and testable data pipelines. This guide covers everything you need to develop DLT pipelines in the ML Pipelines repository.

DLT Benefits

Why Use DLT?

Automatic Dependency Management: DLT infers table dependencies from queries
Built-in Data Quality: Expectations validate data at ingestion
Automatic Schema Evolution: Handles schema changes gracefully
Streaming & Batch: Unified API for both modes
Error Handling: Quarantine bad data instead of failing pipelines
Observability: Rich metrics and lineage tracking
Cost Optimization: Serverless compute and auto-scaling

When to Use DLT

Data ingestion from external sources (S3, APIs, databases)
Medallion architecture (bronze/silver/gold transformations)
ML feature engineering with data quality requirements
Streaming analytics with real-time transformations
AI model inference with ai_query integration

Pipeline Structure (Bronze/Silver/Gold)

Medallion Architecture

For complete medallion architecture details, data lineage, and layer responsibilities, see Data Flow Architecture.

Quick Summary:

Bronze: Raw data ingestion from S3 volumes (minimal transformation)
Silver: Cleaned, validated data with AI model inference (ai_query)
Gold: Pre-computed aggregations and analytics-ready tables

Directory Structure

resources/pipelines/
├── bronze/
│   └── data_ingestion/
│       ├── run_bronze_data_ingestion.pipeline.yml
│       └── run_bronze_data_ingestion.py
├── silver/
│   ├── sentiment_analysis/
│   │   ├── sentiment_analysis.pipeline.yml
│   │   └── transformations/
│   │       ├── 01_sentiment_features_raw.sql
│   │       └── 02_sentiment_features.sql
│   ├── emoji_analysis/
│   │   └── transformations/
│   └── feature_analysis/
│       └── transformations/
└── gold/
    └── aggregations/

Schema Definitions Using table_schemas Package

Bronze Schema Example

Bronze schemas are defined in /Users/taylorlaing/Development/refresh-os/ml-pipelines/src/table_schemas/bronze_schemas.py:

from pyspark.sql.types import StructType, StructField, StringType, TimestampType

messages_schema = StructType([
    # Core identifiers
    StructField("message_id", StringType(), False),
    StructField("tenant_id", StringType(), False),
    StructField("user_id", StringType(), True),

    # Content
    StructField("redacted_content", StringType(), True),
    StructField("timestamp", TimestampType(), True),

    # Debug columns for Auto Loader
    StructField("_rescued_data", StringType(), True),
    StructField("_corrupt_record", StringType(), True),
])

BRONZE_SCHEMAS = {
    "messages": messages_schema,
    "message_participants": message_participants_schema,
    # ... more schemas
}

Using Schemas in DLT

Python DLT Notebook (run_bronze_data_ingestion.py):

import dlt
from table_schemas.bronze_schemas import BRONZE_SCHEMAS
from pyspark.sql import functions as F

@dlt.table(
    name="messages",
    schema=BRONZE_SCHEMAS["messages"],  # Use centralized schema
    comment="Raw messages from various services",
    table_properties={
        "quality": "bronze",
        "pipelines.autoOptimize.zOrderCols": "message_id,tenant_id",
    },
)
@dlt.expect_or_drop("valid_message_id", "message_id IS NOT NULL")
@dlt.expect_or_drop("valid_content", "redacted_content IS NOT NULL")
def bronze_messages():
    catalog = spark.conf.get("CATALOG")
    volume_path = f"/Volumes/{catalog}/bronze/raw_messages"

    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "json")
        .option("cloudFiles.schemaEvolutionMode", "rescue")
        .option("rescuedDataColumn", "_rescued_data")
        .schema(BRONZE_SCHEMAS["messages"])  # Enforce schema
        .load(volume_path)
        .select(
            F.col("message_id"),
            F.col("redacted_content"),
            F.col("timestamp"),
            F.col("created_date"),
        )
    )

Benefits of Centralized Schemas

Single source of truth: Schema changes in one place
Reusability: Share schemas across notebooks and tests
Type safety: IDE autocomplete and validation
Documentation: Inline comments in schema definitions
Testing: Use schemas in unit tests

Data Quality with Expectations

Expectation Types

DLT provides three expectation decorators:

# 1. @dlt.expect - Log violations but continue
@dlt.expect("valid_email", "user_email LIKE '%@%.%'")

# 2. @dlt.expect_or_drop - Drop violating rows
@dlt.expect_or_drop("valid_id", "message_id IS NOT NULL")

# 3. @dlt.expect_or_fail - Fail pipeline on violations
@dlt.expect_or_fail("critical_data", "timestamp IS NOT NULL")

Common Expectations

Bronze Layer (Lenient):

@dlt.table(name="messages")
@dlt.expect_or_drop("valid_message_id", "message_id IS NOT NULL")
@dlt.expect_or_drop("valid_content", "redacted_content IS NOT NULL")
@dlt.expect("valid_timestamp", "timestamp IS NOT NULL")  # Log only
def bronze_messages():
    # ...

Silver Layer (Strict):

@dlt.table(name="sentiment_features")
@dlt.expect_or_fail("has_message_id", "message_id IS NOT NULL")
@dlt.expect_or_drop("valid_score", "sentiment_score BETWEEN -1 AND 1")
@dlt.expect_or_drop("recent_data", "timestamp > CURRENT_DATE() - INTERVAL 90 DAYS")
def sentiment_features():
    # ...

Complex Expectations:

@dlt.expect_or_drop(
    "valid_participant",
    "(participant_email IS NOT NULL) OR (participant_user_id IS NOT NULL)"
)
@dlt.expect_or_drop(
    "valid_emotion_range",
    "go_joy BETWEEN 0 AND 1 AND go_anger BETWEEN 0 AND 1"
)

Monitoring Expectations

-- View expectation violations in event log
SELECT
    name,
    failed_records,
    passed_records,
    timestamp
FROM event_log
WHERE level = 'WARN'
  AND message LIKE '%expectation%'
ORDER BY timestamp DESC;

ai_query Integration Patterns

The ai_query Challenge

The ai_query function has strict schema requirements. See Model Deployment Guide for detailed learnings.

Key Findings:

Models must return 100% stable schemas (no dynamic keys)
All primitives should be stringified in model output
Use two-stage pipeline architecture for reliability

Two-Stage AI Processing Pattern

Stage 1: Raw AI Processing (01_sentiment_features_raw.sql):

-- Stage 1: Call AI model and store raw results
CREATE OR REPLACE STREAMING LIVE TABLE sentiment_features_raw
COMMENT "Stage 1: Raw AI processing with schema cleanup"
TBLPROPERTIES('pipelines.channel' = 'PREVIEW')
AS
SELECT
  message_id,
  timestamp,
  message_type,
  message_source,
  redacted_content,
  ai_query(
    endpoint => "sentiment_analysis",
    request => redacted_content,
    failOnError => false  -- Don't fail pipeline on AI errors
  ) AS raw_ai_result,
  CURRENT_TIMESTAMP() AS processed_at
FROM STREAM(${CATALOG}.bronze.messages)
WHERE redacted_content IS NOT NULL
  AND LENGTH(redacted_content) > 10;  -- Skip empty messages

Stage 2: Field Extraction (02_sentiment_features.sql):

-- Stage 2: Extract fields and cast to proper types
CREATE OR REPLACE STREAMING LIVE TABLE sentiment_features
COMMENT "Stage 2: Final silver table with field extraction"
AS
SELECT
  message_id,
  timestamp,

  -- Extract and cast fields from raw AI result
  raw_ai_result.result.sentiment_polarity AS sentiment_polarity,
  CAST(NULLIF(raw_ai_result.result.sentiment_score, '') AS DECIMAL(10, 6)) AS sentiment_score,
  CAST(NULLIF(raw_ai_result.result.emotional_expressiveness, '') AS DECIMAL(10, 6)) AS emotional_expressiveness,

  -- Extract all 28 Go Emotions
  CAST(NULLIF(raw_ai_result.result.go_joy, '') AS DECIMAL(10, 6)) AS go_joy,
  CAST(NULLIF(raw_ai_result.result.go_anger, '') AS DECIMAL(10, 6)) AS go_anger,
  -- ... more emotions

  -- Extract workplace features
  CAST(NULLIF(raw_ai_result.result.gratitude_sentiment_score, '') AS DECIMAL(10, 6)) AS gratitude_sentiment_score,

  -- Convert struct to JSON string for storage
  to_json(raw_ai_result.result.evidence_tokens_json) AS evidence_tokens_json,

  CAST(COALESCE(raw_ai_result.result.created_date, CURRENT_TIMESTAMP()) AS TIMESTAMP) AS created_date
FROM STREAM(LIVE.sentiment_features_raw)
WHERE raw_ai_result IS NOT NULL;

Benefits of Two-Stage Pattern

Error Isolation: AI failures don't block field extraction
Performance Visibility: See AI processing separately from casting
Debugging: Inspect raw results in Stage 1 table
Schema Evolution: Change field extraction without re-running AI
Cost Efficiency: Reprocess Stage 2 without expensive AI calls

AI Query Configuration

# In pipeline YAML configuration
configuration:
  # AI query optimizations
  "spark.databricks.ai.query.timeout": "120s"
  "spark.databricks.ai.query.maxConcurrentRequests": "20"
  "spark.databricks.ai.query.retryPolicy": "exponential"

  # Streaming optimizations for AI workloads
  "spark.sql.streaming.maxBytesPerTrigger": "200MB"
  "spark.sql.streaming.trigger.processingTime": "30 seconds"
  "spark.sql.streaming.maxRowsPerTrigger": "1000"

Handling AI Query Errors

-- Check for AI errors
SELECT
  message_id,
  raw_ai_result,
  processed_at
FROM sentiment_features_raw
WHERE raw_ai_result IS NULL
  OR raw_ai_result.error IS NOT NULL
LIMIT 100;

-- Count error rate
SELECT
  COUNT(*) as total_messages,
  SUM(CASE WHEN raw_ai_result IS NULL THEN 1 ELSE 0 END) as failed_ai_calls,
  SUM(CASE WHEN raw_ai_result IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as error_rate_pct
FROM sentiment_features_raw;

Streaming vs Batch Pipelines

Streaming Pipelines

When to Use:

Real-time data processing
Incremental updates
Event-driven architectures

Example:

CREATE OR REFRESH STREAMING LIVE TABLE messages_stream
AS
SELECT *
FROM STREAM(${CATALOG}.bronze.messages)
WHERE timestamp > CURRENT_TIMESTAMP() - INTERVAL 7 DAYS;

Python Streaming:

@dlt.table(name="streaming_features")
def streaming_transform():
    return (
        dlt.read_stream("bronze_messages")
        .filter("timestamp > current_date() - INTERVAL 30 DAYS")
    )

Batch Pipelines

When to Use:

Full table refreshes
Historical backfills
Aggregations over complete datasets

Example:

CREATE OR REFRESH LIVE TABLE daily_summary
AS
SELECT
  DATE(timestamp) as date,
  COUNT(*) as message_count,
  AVG(sentiment_score) as avg_sentiment
FROM ${CATALOG}.silver.sentiment_features
GROUP BY DATE(timestamp);

Hybrid Approach

# Read stream, write batch aggregation
@dlt.table(name="hourly_metrics")
def hourly_metrics():
    return (
        dlt.read_stream("sentiment_features")
        .groupBy(F.window("timestamp", "1 hour"))
        .agg(
            F.count("*").alias("message_count"),
            F.avg("sentiment_score").alias("avg_sentiment")
        )
    )

Catalog and Schema References

Dynamic Catalog Resolution

All pipelines must support multiple catalogs (sandbox/dev/staging/prod):

# Get catalog from configuration
def get_catalog():
    """Get catalog name dynamically for dev/staging/prod environments"""
    import os
    from pyspark.sql import SparkSession

    spark = SparkSession.getActiveSession()

    # Try multiple sources in order of preference
    catalog = (
        os.getenv("CATALOG")  # From pipeline configuration
        or (spark.conf.get("CATALOG", None) if spark else None)
        or (spark.conf.get("pipelines.catalog", None) if spark else None)
        or os.getenv("DATABRICKS_CATALOG", "dev")
    )
    print(f"Using catalog: {catalog}")
    return catalog

# Use in table definitions
catalog = get_catalog()
volume_path = f"/Volumes/{catalog}/bronze/raw_messages"

SQL Catalog References

-- Use ${CATALOG} variable in SQL
SELECT * FROM ${CATALOG}.bronze.messages;

-- Stream from bronze
SELECT * FROM STREAM(${CATALOG}.bronze.messages);

-- Reference LIVE tables (within same pipeline)
SELECT * FROM LIVE.sentiment_features_raw;
SELECT * FROM STREAM(LIVE.bronze_messages);

Cross-Catalog Reads

-- Sandbox reads from dev catalog
CREATE OR REPLACE STREAMING LIVE TABLE my_experiment
AS
SELECT
  message_id,
  ai_query('my_sandbox_model', redacted_content) as result
FROM STREAM(dev.bronze.messages)  -- Read from dev
-- Results written to taylor_sandbox.silver.my_experiment

Error Handling

Rescue Columns (Bronze Layer)

@dlt.table(name="messages")
def bronze_messages():
    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.schemaEvolutionMode", "rescue")
        .option("rescuedDataColumn", "_rescued_data")  # Capture unparseable data
        .option("columnNameOfCorruptRecord", "_corrupt_record")  # Capture corrupt JSON
        .load(volume_path)
    )

Check rescued data:

SELECT
  message_id,
  _rescued_data,
  _corrupt_record
FROM ${CATALOG}.bronze.messages
WHERE _rescued_data IS NOT NULL
   OR _corrupt_record IS NOT NULL;

Quarantine Tables

-- Create quarantine table for invalid records
CREATE OR REFRESH STREAMING LIVE TABLE invalid_messages
AS
SELECT *
FROM STREAM(${CATALOG}.bronze.messages)
WHERE message_id IS NULL
   OR redacted_content IS NULL
   OR LENGTH(redacted_content) < 10;

-- Main processing excludes invalid records
CREATE OR REFRESH STREAMING LIVE TABLE valid_messages
AS
SELECT *
FROM STREAM(${CATALOG}.bronze.messages)
WHERE message_id IS NOT NULL
  AND redacted_content IS NOT NULL
  AND LENGTH(redacted_content) >= 10;

Testing DLT Pipelines

Unit Testing Schema Definitions

File: /Users/taylorlaing/Development/refresh-os/ml-pipelines/tests/test_schemas.py

import pytest
from table_schemas.bronze_schemas import BRONZE_SCHEMAS
from pyspark.sql.types import StructType

def test_messages_schema():
    schema = BRONZE_SCHEMAS["messages"]
    assert isinstance(schema, StructType)

    # Verify required fields
    field_names = [f.name for f in schema.fields]
    assert "message_id" in field_names
    assert "redacted_content" in field_names
    assert "timestamp" in field_names

Integration Testing Pipelines

Local Testing (before deployment):

# In exploration notebook
from table_schemas.bronze_schemas import BRONZE_SCHEMAS
import pyspark.sql.functions as F

# Create sample data
sample_messages = spark.createDataFrame([
    ("msg1", "tenant1", "This is a test", "2025-01-01T10:00:00Z"),
    ("msg2", "tenant1", "Another message", "2025-01-01T11:00:00Z"),
], ["message_id", "tenant_id", "redacted_content", "timestamp"])

# Apply transformation logic
result = sample_messages.filter(
    F.col("redacted_content").isNotNull()
).select(
    F.col("message_id"),
    F.length(F.col("redacted_content")).alias("content_length")
)

display(result)

Testing in Sandbox

# 1. Deploy to sandbox
make deploy

# 2. Trigger pipeline update
databricks pipelines start-update <pipeline_id> --profile ref-dev

# 3. Check results
databricks sql execute "SELECT COUNT(*) FROM taylor_sandbox.silver.sentiment_features" --profile ref-dev

# 4. Validate data quality
databricks sql execute "
  SELECT
    COUNT(*) as total,
    SUM(CASE WHEN sentiment_score IS NULL THEN 1 ELSE 0 END) as null_scores
  FROM taylor_sandbox.silver.sentiment_features
" --profile ref-dev

Common Pitfalls and Solutions

1. Schema Evolution Failures

Problem: New fields added but DLT doesn't pick them up

Solution: Use CREATE OR REPLACE to reset schema:

CREATE OR REPLACE STREAMING LIVE TABLE sentiment_features
AS SELECT ...

Or enable schema evolution:

configuration:
  "spark.databricks.delta.schema.autoMerge.enabled": "true"

2. Catalog Variable Not Propagating

Problem: Pipeline references wrong catalog

Solution: Ensure configuration block passes variables:

# In pipeline YAML
configuration:
  "CATALOG": "${var.catalog_name}"
  "ENVIRONMENT": "${var.environment}"

3. Streaming State Conflicts

Problem: Pipeline fails with "State store not found"

Solution: Reset checkpoint location:

-- Add to pipeline configuration
"spark.sql.streaming.checkpointLocation": "/tmp/checkpoints/${pipeline_name}"
"spark.sql.streaming.forceDeleteTempCheckpointLocation": "true"

Or delete and recreate pipeline.

4. ai_query Schema Errors

Problem: AI_FUNCTION_MODEL_SCHEMA_PARSE_ERROR

Solution: Follow two-stage pattern and ensure model returns stable schema:

All model outputs must be strings (no raw floats/ints)
No dynamic dictionary keys
Use golden example for signature inference

See Model Deployment Guide for full details.

5. Memory Errors with Large Batches

Problem: Executor OOM errors during processing

Solution: Reduce batch size:

configuration:
  "spark.sql.streaming.maxBytesPerTrigger": "100MB"  # Reduce from 200MB
  "spark.sql.streaming.maxRowsPerTrigger": "500"     # Reduce from 1000

6. Slow Performance

Problem: Pipeline takes hours to process

Solutions:

# Enable adaptive query execution
"spark.sql.adaptive.enabled": "true"
"spark.sql.adaptive.coalescePartitions.enabled": "true"

# Enable Delta optimizations
"spark.databricks.delta.autoOptimize.optimizeWrite": "true"
"spark.databricks.delta.optimizeWrite.enabled": "true"

Best Practices

1. Pipeline Organization

One pipeline per layer: Separate bronze, silver, gold pipelines
Logical grouping: Group related transformations (e.g., sentiment_analysis)
File naming: Use numbered prefixes for execution order (01_raw.sql, 02_features.sql)

2. Naming Conventions

# Bronze tables (raw data)
bronze_messages
bronze_calendar_events

# Silver tables (processed features)
sentiment_features
emoji_features
linguistic_features

# Gold tables (aggregations)
user_sentiment_summary
daily_feature_trends

3. Comments and Documentation

CREATE OR REFRESH STREAMING LIVE TABLE sentiment_features
COMMENT "Sentiment analysis features extracted from messages using RoBERTa model"
TBLPROPERTIES(
  'quality' = 'silver',
  'layer' = 'features',
  'model_version' = 'roberta-go-emotions-v1'
)
AS SELECT ...

4. Performance Optimization

Use serverless for most pipelines (faster startup, auto-scaling)
Enable Photon for complex SQL-heavy workloads
Use streaming for incremental processing
Batch only for full refreshes

5. Data Quality Strategy

Bronze:  @dlt.expect (log only)
Silver:  @dlt.expect_or_drop (quarantine bad data)
Gold:    @dlt.expect_or_fail (strict quality)

Pipeline Configuration Reference

Full pipeline YAML example (sentiment_analysis.pipeline.yml):

resources:
  pipelines:
    sentiment_analysis_pipeline:
      name: "${var.resource_prefix}_sentiment-analysis"
      catalog: ${var.catalog_name}
      target: silver

      libraries:
        - glob:
            include: ./transformations/**

      configuration:
        bundle.sourcePath: ${workspace.file_path}/src
        "CATALOG": "${var.catalog_name}"
        "ENVIRONMENT": "${var.environment}"

        # Streaming optimizations
        "spark.sql.streaming.maxBytesPerTrigger": "200MB"
        "spark.sql.streaming.trigger.processingTime": "30 seconds"

        # AI query settings
        "spark.databricks.ai.query.timeout": "120s"
        "spark.databricks.ai.query.maxConcurrentRequests": "20"
        "spark.databricks.ai.query.retryPolicy": "exponential"

        # Delta optimizations
        "spark.databricks.delta.autoOptimize.optimizeWrite": "true"
        "spark.sql.adaptive.enabled": "true"

      serverless: true
      continuous: false
      development: false

Local Development Guide - Sandbox workflow and iteration
Model Deployment Guide - ai_query compatibility and best practices
Configuration Reference - All pipeline settings explained
Troubleshooting Guide - Common issues and solutions

External Resources

PreviousDebugging Guide NextDeveloper Getting Started Guide

Last updated 5 months ago

hashtagOverview

hashtagDLT Benefits

hashtagWhy Use DLT?

hashtagWhen to Use DLT

hashtagPipeline Structure (Bronze/Silver/Gold)

hashtagMedallion Architecture

hashtagDirectory Structure

hashtagSchema Definitions Using table_schemas Package

hashtagBronze Schema Example

hashtagUsing Schemas in DLT

hashtagBenefits of Centralized Schemas

hashtagData Quality with Expectations

hashtagExpectation Types

hashtagCommon Expectations

hashtagMonitoring Expectations

hashtagai_query Integration Patterns

hashtagThe ai_query Challenge

hashtagTwo-Stage AI Processing Pattern

hashtagBenefits of Two-Stage Pattern

hashtagAI Query Configuration

hashtagHandling AI Query Errors

hashtagStreaming vs Batch Pipelines

hashtagStreaming Pipelines

hashtagBatch Pipelines

hashtagHybrid Approach

hashtagCatalog and Schema References

hashtagDynamic Catalog Resolution

hashtagSQL Catalog References

hashtagCross-Catalog Reads

hashtagError Handling

hashtagRescue Columns (Bronze Layer)

hashtagQuarantine Tables

hashtagTesting DLT Pipelines

hashtagUnit Testing Schema Definitions

hashtagIntegration Testing Pipelines

hashtagTesting in Sandbox

hashtagCommon Pitfalls and Solutions

hashtag1. Schema Evolution Failures

hashtag2. Catalog Variable Not Propagating

hashtag3. Streaming State Conflicts

hashtag4. ai_query Schema Errors

hashtag5. Memory Errors with Large Batches

hashtag6. Slow Performance

hashtagBest Practices

hashtag1. Pipeline Organization

hashtag2. Naming Conventions

hashtag3. Comments and Documentation

hashtag4. Performance Optimization

hashtag5. Data Quality Strategy

hashtagPipeline Configuration Reference

hashtagRelated Documentation

hashtagExternal Resources