Developer Getting Started Guide

Welcome to the ML Pipelines project! This guide will help you get up and running with your local development environment.

Prerequisites

Before you begin, ensure you have the following installed:

Required Tools

Python 3.13+ - Download
uv - Python package manager: curl -LsSf https://astral.sh/uv/install.sh | sh
Databricks CLI - Install via: databricks/setup-cli@main or pip install databricks-cli
AWS CLI - Installation Guide
make - Usually pre-installed on macOS/Linux
Git - Download

Account Access

You need access to:

Databricks Workspace (dev workspace: dbc-a72d6af9-df3d.cloud.databricks.com)
AWS Account (for S3 access to volumes)
GitHub Repository (refresh-os/ml-pipelines)

Required Credentials

Databricks Profile: ref-dev configured in ~/.databrickscfg
AWS Profile: ref-ml-core configured via aws sso login

Day 1: Initial Setup

Step 1: Clone the Repository

git clone https://github.com/refresh-os/ml-pipelines.git
cd ml-pipelines

Step 2: Configure Databricks CLI

Create or update your ~/.databrickscfg file:

[ref-dev]
host = https://dbc-a72d6af9-df3d.cloud.databricks.com
auth_type = oauth

Test the connection:

databricks current-user me --profile ref-dev

You should see your user information returned.

Step 3: Configure AWS CLI

# Log in via AWS SSO
aws sso login --profile ref-ml-core

# Verify access
aws s3 ls --profile ref-ml-core

Step 4: Understand Your Sandbox Catalog

Your personal sandbox catalog is named {your_username}_sandbox. For example, if your Databricks username is [email protected], your catalog will be taylor_sandbox.

Sandbox Isolation:

You have full read/write access to your sandbox catalog
You have read-only access to the dev catalog for shared data
Your work is completely isolated from other developers

First Deployment

Step 1: Validate the Bundle

make validate

This checks your databricks.yml configuration for syntax errors.

Expected Output:

Bundle validation passed

Step 2: Deploy to Sandbox

make deploy

This command will:

Extract your short username from Databricks
Check if your sandbox catalog exists (creates if needed)
Create S3 volume directories
Build the Python wheel
Deploy pipelines and jobs to your sandbox

Expected Output:

Deploying to your personal sandbox...
Checking if catalog taylor_sandbox exists...
Catalog already exists
Creating S3 volume directories for taylor_sandbox...
S3 directories created
Deploying bundle...
Sandbox deployment complete

Step 3: Verify Deployment

Check your Databricks workspace:

Navigate to: Workflows > Delta Live Tables
Look for pipelines named like: taylor_bronze_data_ingestion
Click on a pipeline to see its configuration

Check your Unity Catalog:

databricks catalogs get taylor_sandbox --profile ref-dev
databricks schemas list taylor_sandbox --profile ref-dev

You should see schemas: bronze, silver, gold, models

Understanding Your Sandbox

Catalog Structure

taylor_sandbox/
├── bronze/              # Raw data (usually empty initially)
│   ├── messages
│   ├── calendar_events
│   └── work_items
├── silver/              # Processed data (from your pipelines)
├── gold/                # Business-ready data (from your models)
└── models/              # Your ML models

Data Strategy

Read from dev catalog:

dev.bronze.* - Raw data shared across team
dev.silver.* - Processed data shared across team

Write to your sandbox:

{username}_sandbox.gold.* - Your experimental features
{username}_sandbox.models.* - Your experimental models

Why this approach?

No data duplication (faster, cheaper)
Test against realistic data
Complete isolation for your experiments
No risk of breaking other developers' work

Daily Development Workflow

Once you've completed the initial setup, this is your day-to-day workflow for making changes.

Making Changes to Pipelines

1. Edit Pipeline Code

Pipelines use DLT SQL files in resources/pipelines/*/transformations/:

# Edit a transformation
code resources/pipelines/silver/sentiment_analysis/transformations/01_sentiment_features_raw.sql

Example Change:

-- Before
WHERE redacted_content IS NOT NULL

-- After (add length filter)
WHERE redacted_content IS NOT NULL
  AND LENGTH(redacted_content) > 50

2. Redeploy to Sandbox

make deploy

This rebuilds the wheel and updates your sandbox pipelines.

3. Test the Changes

# Start a pipeline update
databricks pipelines start-update <pipeline_id> --profile ref-dev

# Or use the Databricks UI to trigger

4. Verify Results

-- In Databricks SQL Editor
SELECT COUNT(*) FROM {username}_sandbox.silver.sentiment_features_raw;

-- Check data quality
SELECT * FROM {username}_sandbox.silver.sentiment_features_raw
WHERE processed_at > CURRENT_TIMESTAMP() - INTERVAL 10 MINUTES
LIMIT 10;

Making Changes to Models

1. Edit Model Code

Models live in src/models/internal/ or src/models/external/:

# Edit model logic
code src/models/internal/sentiment_analysis/model.py

2. Update Registration Script (if needed)

# Edit model registration
code resources/jobs/model_registration/internal/sentiment_analysis/register_sentiment_analysis.py

3. Redeploy and Register

# Deploy changes
make deploy

# Run model registration job in sandbox
databricks jobs run-now <job_id> --profile ref-dev

4. Test Model Serving

# In Databricks notebook or local script
from mlflow.tracking import MlflowClient

client = MlflowClient()

# Load model from your sandbox
model_uri = "models:/{username}_sandbox.models.sentiment_analysis@champion"
model = mlflow.pyfunc.load_model(model_uri)

# Test prediction
result = model.predict(["This is a test message"])
print(result)

Testing Changes Locally

Run Unit Tests

# Run all tests
make test

# Run specific test file
uv run pytest tests/test_sentiment_analysis.py -v

# Run with coverage
uv run pytest --cov=src tests/

Interactive Testing with Notebooks

Create a test notebook in resources/pipelines/*/explorations/

Set catalog context:

# In notebook cell 1
catalog = "{username}_sandbox"
spark.conf.set("CATALOG", catalog)

Import your code:

# In notebook cell 2
from table_schemas.bronze_schemas import BRONZE_SCHEMAS
from models.internal.sentiment_analysis.model import SentimentModel

# Test locally
model = SentimentModel()
result = model.predict(["Test message"])

Reading from Dev Catalog

Available Dev Tables

You can read from the shared dev catalog to avoid data duplication:

-- Bronze tables (raw data)
dev.bronze.messages
dev.bronze.message_participants
dev.bronze.calendar_events
dev.bronze.calendar_event_participants
dev.bronze.work_items
dev.bronze.work_item_participants
dev.bronze.work_item_state_changes

-- Silver tables (processed data with model predictions)
dev.silver.sentiment_features
dev.silver.emoji_features
dev.silver.linguistic_features
dev.silver.feature_analysis

Reading Data in Pipelines

-- In DLT SQL transformation
CREATE OR REPLACE STREAMING LIVE TABLE my_experiment
AS
SELECT
  message_id,
  redacted_content,
  sentiment_score
FROM STREAM(dev.silver.sentiment_features)  -- Read from dev
WHERE sentiment_score > 0.8;

Reading Data in Notebooks

# Read bronze data
messages_df = spark.read.table("dev.bronze.messages")

# Read silver data
sentiment_df = spark.read.table("dev.silver.sentiment_features")

# Join across catalogs
result = messages_df.join(
    sentiment_df,
    on="message_id",
    how="inner"
)

Debugging Pipeline Failures

Common Failure Patterns

1. Schema Conflicts

Error:

[INCOMPATIBLE_VIEW_SCHEMA_CHANGE] The SQL query of view `{username}_sandbox`.`silver`.`sentiment_features`
has an incompatible schema change

Solution:

-- Use CREATE OR REPLACE to reset schema
CREATE OR REPLACE STREAMING LIVE TABLE sentiment_features_raw
AS SELECT ...

Or drop and recreate:

# Drop the problematic table
databricks sql execute "DROP TABLE {username}_sandbox.silver.sentiment_features_raw" --profile ref-dev

# Redeploy pipeline
make deploy

2. AI Query Timeouts

Error:

ai_query timeout: Request exceeded 120s timeout

Solution: Add retry logic and timeout configuration in pipeline YAML:

configuration:
  "spark.databricks.ai.query.timeout": "180s"
  "spark.databricks.ai.query.retryPolicy": "exponential"
  "spark.databricks.ai.query.maxConcurrentRequests": "10"

3. Volume Access Errors

Error:

Path does not exist: /Volumes/{username}_sandbox/bronze/raw_messages

Solution: Create volume directories manually:

# Using AWS CLI
aws s3api put-object \
  --bucket ref-ml-core-dev-workspace-bucket \
  --key {username}_sandbox/volumes/bronze/raw_messages/ \
  --profile ref-ml-core

Or use make deploy which creates these automatically.

For more debugging help, see the Debugging Guide.

Common First-Day Issues

Issue: "Catalog does not exist"

Problem: Your sandbox catalog wasn't created automatically.

Solution:

databricks catalogs create {username}_sandbox \
  --storage-root s3://ref-ml-core-dev-workspace-bucket/ \
  --profile ref-dev

Issue: "Authentication failed"

Problem: Databricks CLI not configured correctly.

Solution:

Run databricks auth login --host https://dbc-a72d6af9-df3d.cloud.databricks.com
Follow the OAuth flow in your browser
Update ~/.databrickscfg if needed

Issue: "AWS credentials not found"

Problem: AWS SSO session expired.

Solution:

aws sso login --profile ref-ml-core

Issue: "Permission denied on external location"

Problem: Service principal needs permissions.

Solution: This is a one-time setup issue. Contact DevOps team - see Service Principals Guide.

Getting Help

Documentation

Quick Reference: README.md
Architecture: ARCHITECTURE.md
Troubleshooting: Troubleshooting Guide

Team Communication

Questions: Post in team Slack channel
Issues: Create GitHub issue
Urgent: Escalate to team lead

Checklist

Learn More

For deeper dives into specific topics:

Model Deployment Guide - How to register and deploy models
DLT Pipelines Guide - Delta Live Tables development
Testing Guide - How to test your changes
Debugging Guide - Comprehensive debugging strategies
Code Standards - Coding conventions and PR process
Architecture - System architecture overview
Data Flow - Medallion architecture details
Troubleshooting - Quick reference for common issues

Welcome to the team!

PreviousDelta Live Tables (DLT) Pipeline Development Guide NextLocal Development Guide

Last updated 5 months ago

hashtagPrerequisites

hashtagRequired Tools

hashtagAccount Access

hashtagRequired Credentials

hashtagDay 1: Initial Setup

hashtagStep 1: Clone the Repository

hashtagStep 2: Configure Databricks CLI

hashtagStep 3: Configure AWS CLI

hashtagStep 4: Understand Your Sandbox Catalog

hashtagFirst Deployment

hashtagStep 1: Validate the Bundle

hashtagStep 2: Deploy to Sandbox

hashtagStep 3: Verify Deployment

hashtagUnderstanding Your Sandbox

hashtagCatalog Structure

hashtagData Strategy

hashtagDaily Development Workflow

hashtagMaking Changes to Pipelines

hashtag1. Edit Pipeline Code

hashtag2. Redeploy to Sandbox

hashtag3. Test the Changes

hashtag4. Verify Results

hashtagMaking Changes to Models

hashtag1. Edit Model Code

hashtag2. Update Registration Script (if needed)

hashtag3. Redeploy and Register

hashtag4. Test Model Serving

hashtagTesting Changes Locally

hashtagRun Unit Tests

hashtagInteractive Testing with Notebooks

hashtagReading from Dev Catalog

hashtagAvailable Dev Tables

hashtagReading Data in Pipelines

hashtagReading Data in Notebooks

hashtagDebugging Pipeline Failures

hashtagCommon Failure Patterns

hashtag1. Schema Conflicts

hashtag2. AI Query Timeouts

hashtag3. Volume Access Errors

hashtagCommon First-Day Issues

hashtagIssue: "Catalog does not exist"

hashtagIssue: "Authentication failed"

hashtagIssue: "AWS credentials not found"

hashtagIssue: "Permission denied on external location"

hashtagGetting Help

hashtagDocumentation

hashtagTeam Communication

hashtagChecklist

hashtagLearn More

Prerequisites

Required Tools

Account Access

Required Credentials

Day 1: Initial Setup

Step 1: Clone the Repository

Step 2: Configure Databricks CLI

Step 3: Configure AWS CLI

Step 4: Understand Your Sandbox Catalog

First Deployment

Step 1: Validate the Bundle

Step 2: Deploy to Sandbox

Step 3: Verify Deployment

Understanding Your Sandbox

Catalog Structure

Data Strategy

Daily Development Workflow

Making Changes to Pipelines

1. Edit Pipeline Code

2. Redeploy to Sandbox

3. Test the Changes

4. Verify Results

Making Changes to Models

1. Edit Model Code

2. Update Registration Script (if needed)

3. Redeploy and Register

4. Test Model Serving

Testing Changes Locally

Run Unit Tests

Interactive Testing with Notebooks

Reading from Dev Catalog

Available Dev Tables

Reading Data in Pipelines

Reading Data in Notebooks

Debugging Pipeline Failures

Common Failure Patterns

1. Schema Conflicts

2. AI Query Timeouts

3. Volume Access Errors

Common First-Day Issues

Issue: "Catalog does not exist"

Issue: "Authentication failed"

Issue: "AWS credentials not found"

Issue: "Permission denied on external location"

Getting Help

Documentation

Team Communication

Checklist

Learn More