Configuration Reference

Overview

This document provides a complete reference for all configuration files in the ML Pipelines repository, including databricks.yml, pipeline YAMLs, job YAMLs, and volume YAMLs.

Table of Contents

databricks.yml Structure

Location: /Users/taylorlaing/Development/refresh-os/ml-pipelines/databricks.yml

Top-Level Fields

Complete Example

Variables Explained

catalog_name

Type: String Required: Yes Description: Unity Catalog name for the current deployment

Values by Environment:

Usage in Resources:


environment

Type: String Required: Yes Description: Current environment name for tagging and identification

Values:

  • sandbox - Local developer testing

  • dev - Shared integration

  • staging - Pre-production

  • prod - Production

Usage:


s3_bucket

Type: String Required: Yes Description: S3 bucket name for external volumes

Values by Environment:

Usage in Volumes:


resource_prefix

Type: String Required: Yes Description: Prefix for resource names to ensure isolation

Values by Environment:

Usage:

Targets (Environments)

Sandbox Target

Purpose: Local developer testing with personal catalog

Key Features:

  • Uses ${workspace.current_user.short_name} for dynamic catalog naming

  • Deployed to dev workspace using developer credentials

  • Resources prefixed with username to avoid conflicts


Dev Target

Purpose: Shared integration testing with CI/CD

Key Features:

  • Runs as service principal (not user)

  • Deployed to /Shared path

  • Used by GitHub Actions


Staging Target

Purpose: Pre-production validation

Key Features:

  • Separate workspace from dev

  • Paused schedules (manual execution)

  • Git branch pinned to main


Prod Target

Purpose: Production workloads

Key Features:

  • Dedicated production workspace

  • Schedules UNPAUSED (automatic execution)

  • Concurrency limited to 5 for stability

Presets Explained

trigger_pause_status

Values: PAUSED | UNPAUSED Purpose: Control whether job/pipeline schedules are active

By Environment:

  • Sandbox: PAUSED (no automatic runs)

  • Dev: Not set (defaults to UNPAUSED for testing)

  • Staging: PAUSED (manual validation)

  • Prod: UNPAUSED (production schedules)


jobs_max_concurrent_runs

Type: Integer Purpose: Limit how many instances of a job can run simultaneously

Recommendations:

  • Sandbox: 10 (generous for testing)

  • Dev: 10 (allow parallel testing)

  • Staging: 10 (parallel validation)

  • Prod: 5 (conservative for stability)


artifacts_dynamic_version

Type: Boolean Purpose: Generate unique artifact versions on each deployment

Effect: Ensures wheel artifacts are rebuilt on each deployment, preventing caching issues.


tags

Type: Map[String, String] Purpose: Tag all resources for filtering and cost tracking

Common Tags:

  • environment: Environment name

  • managed_by: developer | cicd

  • owner: Username (sandbox only)

  • project: ml-pipelines

Pipeline YAML Reference

Location: resources/pipelines/{layer}/{name}/{name}.pipeline.yml

Basic Structure

Complete Example (Sentiment Analysis)

Field Reference

name

Required: Yes Type: String Description: Pipeline display name

Pattern: "${var.resource_prefix}_pipeline-name"

Examples:

  • Sandbox: taylor_sentiment-analysis

  • Dev: dev_sentiment-analysis

  • Prod: prod_sentiment-analysis


catalog

Required: Yes Type: String Description: Target Unity Catalog

Value: ${var.catalog_name}


target

Required: Yes Type: String Description: Schema name where tables will be created

Values: bronze | silver | gold | models


libraries

Required: Yes Type: Array Description: Code to include in pipeline

Options:


configuration

Required: No Type: Map[String, String] Description: Spark and custom configurations

Common Settings:

Variables:

Streaming:

AI Query:

Delta Optimization:


serverless

Required: No Type: Boolean Default: false Description: Use serverless compute

Recommendation: true for faster startup and auto-scaling


continuous

Required: No Type: Boolean Default: false Description: Run pipeline continuously (streaming)

Values:

  • true: Continuous streaming (always running)

  • false: Triggered mode (run on schedule/manual)


photon

Required: No Type: Boolean Default: false Description: Enable Photon engine for performance

Recommendation: true for complex SQL-heavy workloads


development

Required: No Type: Boolean Default: false Description: Development mode settings

Effect: Enables faster iteration with less validation (use for sandbox only)

Job YAML Reference

Location: resources/jobs/{type}/{name}/{name}.job.yml

Basic Structure

Complete Example (Model Registration)

Field Reference

name

Required: Yes Type: String Pattern: "${var.resource_prefix}_job_name"


tasks

Required: Yes Type: Array Description: Job task definitions

Task Types:


job_clusters

Required: No (if using existing cluster) Type: Array Description: Cluster definitions for job

Common Configuration:

Volume YAML Reference

Location: resources/volumes/external/{name}.yml

Basic Structure

Complete Example

Field Reference

name

Required: Yes Type: String Description: Volume name (must match across environments)


catalog_name

Required: Yes Type: String Value: ${var.catalog_name}


schema_name

Required: Yes Type: String Description: Schema where volume is created

Common Values: bronze (for raw data volumes)


volume_type

Required: Yes Type: String Values: EXTERNAL | MANAGED

Use:

  • EXTERNAL: S3-backed volumes (our use case)

  • MANAGED: Databricks-managed storage


storage_location

Required: Yes (for EXTERNAL volumes) Type: String Pattern: s3://${var.s3_bucket}/${var.catalog_name}/volumes/{schema}/{table}/

Example Resolutions:

  • Sandbox: s3://ref-ml-core-dev-workspace-bucket/taylor_sandbox/volumes/bronze/messages/

  • Dev: s3://ref-ml-core-dev-workspace-bucket/dev/volumes/bronze/messages/

  • Prod: s3://ref-ml-core-prod-workspace-bucket/prod/volumes/bronze/messages/

Common Configuration Patterns

Pattern 1: AI-Heavy Pipeline

Pattern 2: Batch Processing Pipeline

Pattern 3: Development Pipeline (Sandbox)

Pattern 4: Production Pipeline

Environment Variable Summary

Variable
Sandbox
Dev
Staging
Prod

catalog_name

{user}_sandbox

dev

staging

prod

resource_prefix

{user}

dev

staging

prod

environment

sandbox

dev

staging

prod

s3_bucket

ref-ml-core-dev-workspace-bucket

ref-ml-core-dev-workspace-bucket

ref-ml-core-staging-workspace-bucket

ref-ml-core-prod-workspace-bucket

workspace.host

dbc-a72d6af9-df3d

dbc-a72d6af9-df3d

dbc-fab2a42a-8d11

dbc-028d9e53-7ce6

Validation

Validate Configuration

Common Validation Errors

Error: Variable 'catalog_name' not defined

Error: Invalid YAML syntax

Error: Resource not found

Last updated