Infrastructure Management Guide

Overview

This guide covers Terraform infrastructure management for the Databricks ML Pipelines platform. The infrastructure is managed in the infra-core repository under the stacks/ml-databricks stack, following infrastructure-as-code best practices.

Infrastructure Architecture

Repository Structure

Infrastructure Management: All infrastructure is managed in the infra-corearrow-up-right repository.

infra-core/
└── stacks/
    └── ml-databricks/
        ├── main.tf                 # Primary infrastructure definitions
        ├── providers.tf            # Terraform and Databricks providers
        ├── variables.tf            # Input variables
        ├── outputs.tf              # Output values
        ├── schemas.tf              # Unity Catalog schema definitions
        ├── backend.tf              # Remote state configuration
        └── .terraform/             # Terraform working directory

Related Repositories:

Infrastructure Components

The ML Databricks stack manages:

  1. Networking:

    • VPC (10.100.0.0/16 CIDR)

    • Private subnets (2 AZs)

    • Public subnets (NAT gateway)

    • Security groups

    • Route tables and internet gateway

  2. Databricks Workspaces:

    • Dev workspace (dbc-a72d6af9-df3d)

    • Staging workspace (dbc-fab2a42a-8d11)

    • Production workspace (dbc-028d9e53-7ce6)

  3. Data Replication & Disaster Recovery:

    • Primary Region: us-east-1 (US East - N. Virginia)

    • DR Region: us-west-2 (US West - Oregon)

    • S3 Cross-Region Replication: Automatic replication of prod data

    • RPO: 24 hours | RTO: 4 hours

    • Purpose: Business continuity and disaster recovery

    See infra-core repositoryarrow-up-right for Terraform configuration.

  4. Unity Catalog:

    • Metastore assignment

    • Catalog creation (dev, staging, prod)

    • Schema organization (bronze, silver, gold, models)

    • External locations for S3 volumes

  5. Service Principals:

    • ml-pipelines-dev service principal

    • ml-pipelines-staging service principal

    • ml-pipelines-prod service principal

    • Catalog and workspace permissions

Terraform State Management

Remote State Backend

The infrastructure uses S3 for remote state with DynamoDB for state locking.

Configuration (backend.tf):

State Operations

View current state:

Show specific resource:

Pull state locally (for backup):

State Locking

DynamoDB table prevents concurrent modifications:

  • Table: terraform-state-lock

  • Lock acquired automatically on terraform apply

  • Lock released after operation completes

If state is locked:


Workspace Management

Listing Workspaces

Workspace Details

Adding New Workspace

  1. Edit variables.tf to add new workspace configuration:

  2. Plan the change:

  3. Apply if plan looks correct:

  4. Verify workspace created:


Service Principal Management

Listing Service Principals

Service Principal Permissions

Service principals are managed with Terraform but permissions may need manual adjustment.

View current permissions:

Grant additional permissions (if needed):

Service Principal OIDC Configuration

OIDC configuration is managed in Databricks Account Console, not Terraform:

  1. Log into Account Console: https://accounts.cloud.databricks.com/

  2. Navigate to: Service Principals → [Select SP] → Authentication

  3. Click Add GitHub OIDC Configuration

  4. Configure:

    • Issuer: https://token.actions.githubusercontent.com

    • Audience: https://<workspace>.cloud.databricks.com/oidc/v1/token

    • Subject Pattern: repo:refresh-os/ml-pipelines:environment:{environment}


Unity Catalog Management

Catalog Structure

Managed in Terraform schemas.tf:

Adding New Schema

  1. Edit schemas.tf:

  2. Plan and apply:

External Locations for Volumes

External locations connect Unity Catalog to S3:

Verify external location:


Drift Detection

What is Drift?

Drift occurs when infrastructure state diverges from Terraform configuration due to:

  • Manual changes in Databricks UI

  • Changes made by other tools/scripts

  • Deletions or modifications outside Terraform

Detecting Drift

Run terraform plan regularly to detect drift:

Look for:

  • Resources to be updated (shows changed attributes)

  • Resources to be deleted (deleted outside Terraform)

  • Resources to be created (created outside Terraform)

Handling Drift

Option 1: Import Manual Changes into Terraform

If changes were intentional and should be kept:

Option 2: Revert Manual Changes

If changes were accidental:

Option 3: Update Terraform to Match Manual Changes

If manual changes reflect new requirements:

  1. Update Terraform code to match manual changes

  2. Run terraform plan to verify no changes needed

  3. Commit Terraform changes to git

Automated Drift Detection

Set up scheduled drift detection:


Infrastructure Changes Workflow

Standard Change Process

Follow this process for all infrastructure changes:

Step 1: Create Feature Branch

Step 2: Make Infrastructure Changes

Edit Terraform files (main.tf, schemas.tf, etc.)

Step 3: Validate Syntax

Step 4: Plan Changes

Review plan output carefully:

  • What resources will be created?

  • What will be modified?

  • What will be destroyed? (usually none)

  • Are changes expected?

Step 5: Create Pull Request

Create PR in GitHub with:

  • Description of changes

  • Terraform plan output

  • Impact assessment

  • Rollback plan

Step 6: Review and Approval

PR review checklist:

Step 7: Apply Changes

After PR approval:

Step 8: Verify Changes

Step 9: Document Changes

Update documentation:

  • This guide (if infrastructure changed)

  • Architecture docs (if structure changed)

  • CHANGELOG.md

Emergency Changes

For production incidents requiring immediate infrastructure changes:

  1. Document reason for bypassing normal process

  2. Make change with terraform apply

  3. Verify fix resolves incident

  4. Create follow-up PR with same change for review

  5. Document in incident report


Terraform Troubleshooting

Issue 1: State Lock Error

Error:

Cause: Another Terraform operation in progress or stale lock

Solution:

Prevention: Ensure only one person runs Terraform at a time


Issue 2: Provider Authentication Error

Error:

Cause: Databricks credentials expired

Solution:


Issue 3: Resource Already Exists

Error:

Cause: Resource created outside Terraform

Solution:


Issue 4: Workspace Not Found

Error:

Cause: Workspace deleted or ID incorrect

Solution:


State Recovery Procedures

Scenario 1: State File Corrupted

Symptoms: Terraform errors on all operations

Recovery:


Scenario 2: State Out of Sync

Symptoms: Terraform wants to recreate existing resources

Recovery:


Scenario 3: Lost State File

Symptoms: State file deleted, Terraform treats everything as new

Recovery:


Best Practices

1. Always Run Plan Before Apply

2. Use Targeted Operations Carefully

3. Keep State Secure

  • State contains sensitive data (credentials, IDs)

  • S3 bucket has encryption enabled

  • Access restricted to infrastructure team

  • Never commit state to git

4. Document All Changes

  • Update this guide for infrastructure changes

  • Add comments to Terraform code

  • Document in PR descriptions

  • Update CHANGELOG.md

5. Test in Dev First

  • Make changes in dev workspace first

  • Validate functionality

  • Then promote to staging and prod

6. Regular State Backups


Useful Commands Reference

Terraform Commands

Databricks CLI Commands

AWS CLI Commands



Infrastructure Contacts

For infrastructure issues:

For emergencies:

  • Follow incident response procedures

  • Notify #incidents channel

  • Contact on-call engineer

Last updated