Developer Getting Started Guide

Welcome to the ML Pipelines project! This guide will help you get up and running with your local development environment.

Prerequisites

Before you begin, ensure you have the following installed:

Required Tools

Account Access

You need access to:

  • Databricks Workspace (dev workspace: dbc-a72d6af9-df3d.cloud.databricks.com)

  • AWS Account (for S3 access to volumes)

  • GitHub Repository (refresh-os/ml-pipelines)

Required Credentials

  1. Databricks Profile: ref-dev configured in ~/.databrickscfg

  2. AWS Profile: ref-ml-core configured via aws sso login


Day 1: Initial Setup

Step 1: Clone the Repository

Step 2: Configure Databricks CLI

Create or update your ~/.databrickscfg file:

Test the connection:

You should see your user information returned.

Step 3: Configure AWS CLI

Step 4: Understand Your Sandbox Catalog

Your personal sandbox catalog is named {your_username}_sandbox. For example, if your Databricks username is [email protected], your catalog will be taylor_sandbox.

Sandbox Isolation:

  • You have full read/write access to your sandbox catalog

  • You have read-only access to the dev catalog for shared data

  • Your work is completely isolated from other developers


First Deployment

Step 1: Validate the Bundle

This checks your databricks.yml configuration for syntax errors.

Expected Output:

Step 2: Deploy to Sandbox

This command will:

  1. Extract your short username from Databricks

  2. Check if your sandbox catalog exists (creates if needed)

  3. Create S3 volume directories

  4. Build the Python wheel

  5. Deploy pipelines and jobs to your sandbox

Expected Output:

Step 3: Verify Deployment

Check your Databricks workspace:

  1. Navigate to: Workflows > Delta Live Tables

  2. Look for pipelines named like: taylor_bronze_data_ingestion

  3. Click on a pipeline to see its configuration

Check your Unity Catalog:

You should see schemas: bronze, silver, gold, models


Understanding Your Sandbox

Catalog Structure

Data Strategy

Read from dev catalog:

  • dev.bronze.* - Raw data shared across team

  • dev.silver.* - Processed data shared across team

Write to your sandbox:

  • {username}_sandbox.gold.* - Your experimental features

  • {username}_sandbox.models.* - Your experimental models

Why this approach?

  • No data duplication (faster, cheaper)

  • Test against realistic data

  • Complete isolation for your experiments

  • No risk of breaking other developers' work



Daily Development Workflow

Once you've completed the initial setup, this is your day-to-day workflow for making changes.

Making Changes to Pipelines

1. Edit Pipeline Code

Pipelines use DLT SQL files in resources/pipelines/*/transformations/:

Example Change:

2. Redeploy to Sandbox

This rebuilds the wheel and updates your sandbox pipelines.

3. Test the Changes

4. Verify Results

Making Changes to Models

1. Edit Model Code

Models live in src/models/internal/ or src/models/external/:

2. Update Registration Script (if needed)

3. Redeploy and Register

4. Test Model Serving

Testing Changes Locally

Run Unit Tests

Interactive Testing with Notebooks

  1. Create a test notebook in resources/pipelines/*/explorations/

  2. Set catalog context:

  3. Import your code:


Reading from Dev Catalog

Available Dev Tables

You can read from the shared dev catalog to avoid data duplication:

Reading Data in Pipelines

Reading Data in Notebooks


Debugging Pipeline Failures

Common Failure Patterns

1. Schema Conflicts

Error:

Solution:

Or drop and recreate:

2. AI Query Timeouts

Error:

Solution: Add retry logic and timeout configuration in pipeline YAML:

3. Volume Access Errors

Error:

Solution: Create volume directories manually:

Or use make deploy which creates these automatically.

For more debugging help, see the Debugging Guide.


Common First-Day Issues

Issue: "Catalog does not exist"

Problem: Your sandbox catalog wasn't created automatically.

Solution:

Issue: "Authentication failed"

Problem: Databricks CLI not configured correctly.

Solution:

  1. Run databricks auth login --host https://dbc-a72d6af9-df3d.cloud.databricks.com

  2. Follow the OAuth flow in your browser

  3. Update ~/.databrickscfg if needed

Issue: "AWS credentials not found"

Problem: AWS SSO session expired.

Solution:

Issue: "Permission denied on external location"

Problem: Service principal needs permissions.

Solution: This is a one-time setup issue. Contact DevOps team - see Service Principals Guide.


Getting Help

Documentation

Team Communication

  • Questions: Post in team Slack channel

  • Issues: Create GitHub issue

  • Urgent: Escalate to team lead


Checklist


Learn More

For deeper dives into specific topics:


Welcome to the team!

Last updated