Configuration Reference

Overview

This document provides a complete reference for all configuration files in the ML Pipelines repository, including databricks.yml, pipeline YAMLs, job YAMLs, and volume YAMLs.

databricks.yml Structure

Location: /Users/taylorlaing/Development/refresh-os/ml-pipelines/databricks.yml

Top-Level Fields

bundle:
  name: ml_pipelines                    # Bundle identifier
  uuid: d5974c4e-d5be-4c54-9c49-d2d65bf5f637  # Unique bundle ID

variables:
  # Global variables (see Variables Explained section)

artifacts:
  # Python wheel building configuration

include:
  # Resource files to include in bundle

targets:
  # Environment-specific configurations

Complete Example

bundle:
  name: ml_pipelines
  uuid: d5974c4e-d5be-4c54-9c49-d2d65bf5f637

variables:
  catalog_name:
    description: "Unity Catalog name for current deployment"
    default: "default_sandbox"
  environment:
    description: "Environment (sandbox/dev/staging/prod)"
  s3_bucket:
    description: "S3 bucket for external volumes"
  resource_prefix:
    description: "Prefix for resource names (pipelines, jobs, etc.) for isolation"
    default: "default"

artifacts:
  python_artifact:
    type: whl
    build: uv build --wheel

include:
  - resources/pipelines/bronze/*/*.yml
  - resources/pipelines/silver/*/*.yml
  - resources/pipelines/gold/*/*.yml
  - resources/jobs/model_registration/external/*/*.yml
  - resources/jobs/model_registration/internal/*/*.yml
  - resources/volumes/external/*.yml

targets:
  sandbox:
    # Sandbox configuration (see Targets section)
  dev:
    # Dev configuration
  staging:
    # Staging configuration
  prod:
    # Production configuration

Variables Explained

catalog_name

Type: String Required: Yes Description: Unity Catalog name for the current deployment

Values by Environment:

sandbox:  "${workspace.current_user.short_name}_sandbox"  # e.g., taylor_sandbox
dev:      "dev"
staging:  "staging"
prod:     "prod"

Usage in Resources:

# In pipeline YAML
catalog: ${var.catalog_name}

# In volume YAML
catalog_name: ${var.catalog_name}

# In SQL
SELECT * FROM ${CATALOG}.bronze.messages

environment

Type: String Required: Yes Description: Current environment name for tagging and identification

Values:

sandbox - Local developer testing
dev - Shared integration
staging - Pre-production
prod - Production

Usage:

# In pipeline YAML
name: "pipeline_name_${var.environment}"  # e.g., pipeline_name_dev

# In configuration
"ENVIRONMENT": "${var.environment}"

# In tags
tags:
  environment: ${var.environment}

s3_bucket

Type: String Required: Yes Description: S3 bucket name for external volumes

Values by Environment:

sandbox:  "ref-ml-core-dev-workspace-bucket"      # Shared with dev
dev:      "ref-ml-core-dev-workspace-bucket"
staging:  "ref-ml-core-staging-workspace-bucket"
prod:     "ref-ml-core-prod-workspace-bucket"

Usage in Volumes:

storage_location: s3://${var.s3_bucket}/${var.catalog_name}/volumes/bronze/messages/
# Resolves to: s3://ref-ml-core-dev-workspace-bucket/taylor_sandbox/volumes/bronze/messages/

resource_prefix

Type: String Required: Yes Description: Prefix for resource names to ensure isolation

Values by Environment:

sandbox:  "${workspace.current_user.short_name}"  # e.g., taylor
dev:      "dev"
staging:  "staging"
prod:     "prod"

Usage:

# Pipeline names
name: "${var.resource_prefix}_bronze_data_ingestion"
# Results in:
#   Sandbox: taylor_bronze_data_ingestion
#   Dev: dev_bronze_data_ingestion
#   Prod: prod_bronze_data_ingestion

Targets (Environments)

Sandbox Target

Purpose: Local developer testing with personal catalog

sandbox:
  mode: production                      # Production mode (stable naming)
  default: true                         # Default when no target specified
  workspace:
    profile: ref-dev                    # Use local ~/.databrickscfg profile
    root_path: /Workspace/Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target}
  variables:
    catalog_name: ${workspace.current_user.short_name}_sandbox
    resource_prefix: ${workspace.current_user.short_name}
    environment: sandbox
    s3_bucket: ref-ml-core-dev-workspace-bucket
  presets:
    trigger_pause_status: PAUSED        # Don't run on schedule
    jobs_max_concurrent_runs: 10
    artifacts_dynamic_version: true
    tags:
      environment: sandbox
      owner: ${workspace.current_user.short_name}
      managed_by: developer

Key Features:

Uses ${workspace.current_user.short_name} for dynamic catalog naming
Deployed to dev workspace using developer credentials
Resources prefixed with username to avoid conflicts

Dev Target

Purpose: Shared integration testing with CI/CD

dev:
  mode: production
  workspace:
    host: https://dbc-a72d6af9-df3d.cloud.databricks.com
    root_path: /Shared/.bundle/${bundle.name}/${bundle.target}
  run_as:
    service_principal_name: "03ff99cd-a352-40bb-9d33-414c9ad9e7aa"
  variables:
    catalog_name: dev
    environment: dev
    s3_bucket: ref-ml-core-dev-workspace-bucket
    resource_prefix: dev
  presets:
    artifacts_dynamic_version: true
    tags:
      environment: dev
      managed_by: cicd

Key Features:

Runs as service principal (not user)
Deployed to /Shared path
Used by GitHub Actions

Staging Target

Purpose: Pre-production validation

staging:
  mode: production
  workspace:
    host: https://dbc-fab2a42a-8d11.cloud.databricks.com
    root_path: /Shared/.bundle/${bundle.name}/${bundle.target}
  run_as:
    service_principal_name: "93bda7cf-b009-49d8-8e8d-046677c8597e"
  git:
    branch: main
  variables:
    catalog_name: staging
    environment: staging
    s3_bucket: ref-ml-core-staging-workspace-bucket
    resource_prefix: staging
  presets:
    trigger_pause_status: PAUSED        # Manual trigger only
    jobs_max_concurrent_runs: 10
    artifacts_dynamic_version: true
    tags:
      environment: staging
      managed_by: cicd

Key Features:

Separate workspace from dev
Paused schedules (manual execution)
Git branch pinned to main

Prod Target

Purpose: Production workloads

prod:
  mode: production
  workspace:
    host: https://dbc-028d9e53-7ce6.cloud.databricks.com
    root_path: /Shared/.bundle/${bundle.name}/${bundle.target}
  run_as:
    service_principal_name: "2af4b95a-d80f-4da6-bfba-9ec5a8a8ec9f"
  git:
    branch: main
  variables:
    catalog_name: prod
    environment: prod
    s3_bucket: ref-ml-core-prod-workspace-bucket
    resource_prefix: prod
  presets:
    trigger_pause_status: UNPAUSED      # Run on schedule
    jobs_max_concurrent_runs: 5         # Limited concurrency
    artifacts_dynamic_version: true
    tags:
      environment: prod
      managed_by: cicd

Key Features:

Dedicated production workspace
Schedules UNPAUSED (automatic execution)
Concurrency limited to 5 for stability

Presets Explained

trigger_pause_status

Values: PAUSED | UNPAUSED Purpose: Control whether job/pipeline schedules are active

PAUSED:   # Schedules disabled (manual trigger only)
UNPAUSED: # Schedules enabled (runs automatically)

By Environment:

Sandbox: PAUSED (no automatic runs)
Dev: Not set (defaults to UNPAUSED for testing)
Staging: PAUSED (manual validation)
Prod: UNPAUSED (production schedules)

jobs_max_concurrent_runs

Type: Integer Purpose: Limit how many instances of a job can run simultaneously

jobs_max_concurrent_runs: 10  # Allow up to 10 concurrent runs

Recommendations:

Sandbox: 10 (generous for testing)
Dev: 10 (allow parallel testing)
Staging: 10 (parallel validation)
Prod: 5 (conservative for stability)

artifacts_dynamic_version

Type: Boolean Purpose: Generate unique artifact versions on each deployment

artifacts_dynamic_version: true  # Recommended for all environments

Effect: Ensures wheel artifacts are rebuilt on each deployment, preventing caching issues.

Pipeline YAML Reference

Location: resources/pipelines/{layer}/{name}/{name}.pipeline.yml

Basic Structure

resources:
  pipelines:
    pipeline_name:
      name: "${var.resource_prefix}_pipeline-name"
      catalog: ${var.catalog_name}
      target: silver
      libraries:
        - glob:
            include: ./transformations/**
      configuration:
        # Spark configurations
      serverless: true
      continuous: false

Complete Example (Sentiment Analysis)

resources:
  pipelines:
    sentiment_analysis_pipeline:
      name: "${var.resource_prefix}_sentiment-analysis"
      catalog: ${var.catalog_name}
      target: silver

      libraries:
        - glob:
            include: ./transformations/**

      configuration:
        bundle.sourcePath: ${workspace.file_path}/src
        "CATALOG": "${var.catalog_name}"
        "ENVIRONMENT": "${var.environment}"

        # Streaming optimizations
        "spark.sql.streaming.maxBytesPerTrigger": "200MB"
        "spark.sql.streaming.trigger.processingTime": "30 seconds"
        "spark.sql.streaming.maxRowsPerTrigger": "1000"

        # AI query settings
        "spark.databricks.ai.query.timeout": "120s"
        "spark.databricks.ai.query.maxConcurrentRequests": "20"
        "spark.databricks.ai.query.retryPolicy": "exponential"

        # Delta optimizations
        "spark.databricks.delta.autoOptimize.optimizeWrite": "true"
        "spark.databricks.delta.optimizeWrite.enabled": "true"
        "spark.sql.adaptive.enabled": "true"
        "spark.sql.adaptive.coalescePartitions.enabled": "true"

      serverless: true
      continuous: false
      development: false

Field Reference

name

Required: Yes Type: String Description: Pipeline display name

Pattern: "${var.resource_prefix}_pipeline-name"

Examples:

Sandbox: taylor_sentiment-analysis
Dev: dev_sentiment-analysis
Prod: prod_sentiment-analysis

catalog

Required: Yes Type: String Description: Target Unity Catalog

Value: ${var.catalog_name}

target

Required: Yes Type: String Description: Schema name where tables will be created

Values: bronze | silver | gold | models

libraries

Required: Yes Type: Array Description: Code to include in pipeline

Options:

# Include SQL/Python files via glob
- glob:
    include: ./transformations/**

# Include specific notebook
- notebook:
    path: ./pipeline.py

# Include Python file
- file:
    path: ./transformation.py

configuration

Required: No Type: Map[String, String] Description: Spark and custom configurations

Common Settings:

Variables:

"CATALOG": "${var.catalog_name}"           # Catalog reference
"ENVIRONMENT": "${var.environment}"        # Environment name
bundle.sourcePath: ${workspace.file_path}/src  # Source code path

Streaming:

"spark.sql.streaming.maxBytesPerTrigger": "200MB"      # Batch size
"spark.sql.streaming.trigger.processingTime": "30 seconds"  # Trigger interval
"spark.sql.streaming.maxRowsPerTrigger": "1000"        # Max rows per batch

AI Query:

"spark.databricks.ai.query.timeout": "120s"            # Request timeout
"spark.databricks.ai.query.maxConcurrentRequests": "20"  # Concurrency
"spark.databricks.ai.query.retryPolicy": "exponential"  # Retry strategy

Delta Optimization:

"spark.databricks.delta.autoOptimize.optimizeWrite": "true"
"spark.databricks.delta.optimizeWrite.enabled": "true"
"spark.sql.adaptive.enabled": "true"
"spark.sql.adaptive.coalescePartitions.enabled": "true"

serverless

Required: No Type: Boolean Default: false Description: Use serverless compute

Recommendation: true for faster startup and auto-scaling

continuous

Required: No Type: Boolean Default: false Description: Run pipeline continuously (streaming)

Values:

true: Continuous streaming (always running)
false: Triggered mode (run on schedule/manual)

photon

Required: No Type: Boolean Default: false Description: Enable Photon engine for performance

Recommendation: true for complex SQL-heavy workloads

development

Required: No Type: Boolean Default: false Description: Development mode settings

Effect: Enables faster iteration with less validation (use for sandbox only)

Job YAML Reference

Location: resources/jobs/{type}/{name}/{name}.job.yml

Basic Structure

resources:
  jobs:
    job_name:
      name: "${var.resource_prefix}_job_name"
      tasks:
        - task_key: task_name
          notebook_task:
            notebook_path: ./script.py
            base_parameters:
              catalog_name: ${var.catalog_name}
          job_cluster_key: cluster_key
      job_clusters:
        - job_cluster_key: cluster_key
          new_cluster:
            # Cluster configuration

Complete Example (Model Registration)

resources:
  jobs:
    register_emoji_analysis:
      name: "${var.resource_prefix}_register_emoji_analysis"

      tasks:
        - task_key: register_emoji_analysis
          notebook_task:
            notebook_path: ./register_emoji_analysis.py
            base_parameters:
              catalog_name: ${var.catalog_name}
              source_path: ${workspace.file_path}
          disable_auto_optimization: true
          job_cluster_key: register_model_job_cluster

      job_clusters:
        - job_cluster_key: register_model_job_cluster
          new_cluster:
            spark_version: 15.4.x-scala2.12
            node_type_id: i3.2xlarge
            data_security_mode: SINGLE_USER
            autoscale:
              min_workers: 1
              max_workers: 1
            spark_env_vars:
              TOKENIZERS_PARALLELISM: "false"
            spark_conf:
              "spark.sql.adaptive.enabled": "true"
              "spark.sql.adaptive.coalescePartitions.enabled": "true"
              "spark.serializer": "org.apache.spark.serializer.KryoSerializer"
              "spark.sql.execution.arrow.pyspark.enabled": "true"
            custom_tags:
              ResourceClass: "MLProcessing"
              Purpose: "ModelRegistration"
            aws_attributes:
              ebs_volume_type: "GENERAL_PURPOSE_SSD"
              ebs_volume_count: 1
              ebs_volume_size: 128

      performance_target: PERFORMANCE_OPTIMIZED

Field Reference

name

Required: Yes Type: String Pattern: "${var.resource_prefix}_job_name"

tasks

Required: Yes Type: Array Description: Job task definitions

Task Types:

# Notebook task
notebook_task:
  notebook_path: ./script.py
  base_parameters:
    param1: value1

# Python task
spark_python_task:
  python_file: ./script.py
  parameters:
    - "--arg1"
    - "value1"

# SQL task
sql_task:
  warehouse_id: "<warehouse_id>"
  query:
    query_id: "<query_id>"

job_clusters

Required: No (if using existing cluster) Type: Array Description: Cluster definitions for job

Common Configuration:

spark_version: 15.4.x-scala2.12        # LTS runtime
node_type_id: i3.2xlarge               # Instance type
data_security_mode: SINGLE_USER        # Security mode
autoscale:
  min_workers: 1
  max_workers: 4

Volume YAML Reference

Location: resources/volumes/external/{name}.yml

Basic Structure

resources:
  volumes:
    volume_name:
      name: volume_name
      catalog_name: ${var.catalog_name}
      schema_name: bronze
      volume_type: EXTERNAL
      storage_location: s3://${var.s3_bucket}/${var.catalog_name}/volumes/bronze/table/
      comment: "Description"

Complete Example

resources:
  volumes:
    raw_messages:
      name: raw_messages
      catalog_name: ${var.catalog_name}
      schema_name: bronze
      volume_type: EXTERNAL
      storage_location: s3://${var.s3_bucket}/${var.catalog_name}/volumes/bronze/messages/
      comment: "Raw message data from Slack, Teams, Outlook, Gmail, training datasets"

Field Reference

name

Required: Yes Type: String Description: Volume name (must match across environments)

catalog_name

Required: Yes Type: String Value: ${var.catalog_name}

schema_name

Required: Yes Type: String Description: Schema where volume is created

Common Values: bronze (for raw data volumes)

volume_type

Required: Yes Type: String Values: EXTERNAL | MANAGED

Use:

EXTERNAL: S3-backed volumes (our use case)
MANAGED: Databricks-managed storage

storage_location

Required: Yes (for EXTERNAL volumes) Type: String Pattern: s3://${var.s3_bucket}/${var.catalog_name}/volumes/{schema}/{table}/

Example Resolutions:

Sandbox: s3://ref-ml-core-dev-workspace-bucket/taylor_sandbox/volumes/bronze/messages/
Dev: s3://ref-ml-core-dev-workspace-bucket/dev/volumes/bronze/messages/
Prod: s3://ref-ml-core-prod-workspace-bucket/prod/volumes/bronze/messages/

Common Configuration Patterns

Pattern 1: AI-Heavy Pipeline

configuration:
  # AI query optimizations
  "spark.databricks.ai.query.timeout": "180s"
  "spark.databricks.ai.query.maxConcurrentRequests": "20"
  "spark.databricks.ai.query.retryPolicy": "exponential"

  # Streaming for AI workloads
  "spark.sql.streaming.maxBytesPerTrigger": "200MB"
  "spark.sql.streaming.maxRowsPerTrigger": "1000"
  "spark.sql.streaming.trigger.processingTime": "30 seconds"

Pattern 2: Batch Processing Pipeline

configuration:
  # No streaming settings
  "spark.sql.adaptive.enabled": "true"
  "spark.databricks.delta.optimizeWrite.enabled": "true"

continuous: false  # Triggered mode
serverless: true   # Auto-scaling

Pattern 3: Development Pipeline (Sandbox)

configuration:
  # Relaxed settings for fast iteration
  "spark.sql.streaming.maxRowsPerTrigger": "100"  # Small batches

development: true  # Enable dev mode
continuous: false

Pattern 4: Production Pipeline

configuration:
  # Production-grade settings
  "spark.sql.streaming.maxBytesPerTrigger": "500MB"
  "spark.sql.adaptive.enabled": "true"
  "spark.databricks.delta.autoOptimize.optimizeWrite": "true"

serverless: true
continuous: false
development: false  # Full validation

Environment Variable Summary

Variable

Sandbox

Dev

Staging

Prod

catalog_name

{user}_sandbox

dev

staging

prod

resource_prefix

{user}

dev

staging

prod

environment

sandbox

dev

staging

prod

s3_bucket

ref-ml-core-dev-workspace-bucket

ref-ml-core-staging-workspace-bucket

ref-ml-core-prod-workspace-bucket

workspace.host

dbc-a72d6af9-df3d

dbc-fab2a42a-8d11

dbc-028d9e53-7ce6

Validation

Validate Configuration

# Validate for specific target
databricks bundle validate -t sandbox
databricks bundle validate -t dev
databricks bundle validate -t staging
databricks bundle validate -t prod

# Or use Makefile
make validate         # Validates sandbox
make validate-dev
make validate-staging
make validate-prod

Common Validation Errors

Error: Variable 'catalog_name' not defined

# Fix: Add variable to databricks.yml
variables:
  catalog_name:
    description: "Unity Catalog name"

Error: Invalid YAML syntax

# Use yamllint
yamllint databricks.yml

Error: Resource not found

# Ensure file paths in include: match actual files
include:
  - resources/pipelines/*/*.yml  # Check paths exist

Local Development Guide - Using sandbox configuration
DLT Pipelines Guide - Pipeline development details
Deployment Guide - Deploying configurations
CLI Commands Reference - Command-line usage

PreviousCLI Commands Reference NextArchived

Last updated 5 months ago

hashtagOverview

hashtagTable of Contents

hashtagdatabricks.yml Structure

hashtagTop-Level Fields

hashtagComplete Example

hashtagVariables Explained

hashtagcatalog_name

hashtagenvironment

hashtags3_bucket

hashtagresource_prefix

hashtagTargets (Environments)

hashtagSandbox Target

hashtagDev Target

hashtagStaging Target

hashtagProd Target

hashtagPresets Explained

hashtagtrigger_pause_status

hashtagjobs_max_concurrent_runs

hashtagartifacts_dynamic_version

hashtagtags

hashtagPipeline YAML Reference

hashtagBasic Structure

hashtagComplete Example (Sentiment Analysis)

hashtagField Reference

hashtagname

hashtagcatalog

hashtagtarget

hashtaglibraries

hashtagconfiguration

hashtagserverless

hashtagcontinuous

hashtagphoton

hashtagdevelopment

hashtagJob YAML Reference

hashtagBasic Structure

hashtagComplete Example (Model Registration)

hashtagField Reference

hashtagname

hashtagtasks

hashtagjob_clusters

hashtagVolume YAML Reference

hashtagBasic Structure

hashtagComplete Example

hashtagField Reference

hashtagname

hashtagcatalog_name

hashtagschema_name

hashtagvolume_type

hashtagstorage_location

hashtagCommon Configuration Patterns

hashtagPattern 1: AI-Heavy Pipeline

hashtagPattern 2: Batch Processing Pipeline

hashtagPattern 3: Development Pipeline (Sandbox)

hashtagPattern 4: Production Pipeline

hashtagEnvironment Variable Summary

hashtagValidation

hashtagValidate Configuration

hashtagCommon Validation Errors

hashtagRelated Documentation

Overview

Table of Contents

databricks.yml Structure

Top-Level Fields

Complete Example

Variables Explained

catalog_name

environment

s3_bucket

resource_prefix

Targets (Environments)

Sandbox Target

Dev Target

Staging Target

Prod Target

Presets Explained

trigger_pause_status

jobs_max_concurrent_runs

artifacts_dynamic_version

tags

Pipeline YAML Reference

Basic Structure

Complete Example (Sentiment Analysis)

Field Reference

name

catalog

target

libraries

configuration

serverless

continuous

photon

development

Job YAML Reference

Basic Structure

Complete Example (Model Registration)

Field Reference

name

tasks

job_clusters

Volume YAML Reference

Basic Structure

Complete Example

Field Reference

name

catalog_name

schema_name

volume_type

storage_location

Common Configuration Patterns

Pattern 1: AI-Heavy Pipeline

Pattern 2: Batch Processing Pipeline

Pattern 3: Development Pipeline (Sandbox)

Pattern 4: Production Pipeline

Environment Variable Summary

Validation

Validate Configuration

Common Validation Errors

Related Documentation