Infrastructure Management Guide

Overview

This guide covers Terraform infrastructure management for the Databricks ML Pipelines platform. The infrastructure is managed in the infra-core repository under the stacks/ml-databricks stack, following infrastructure-as-code best practices.

Infrastructure Architecture

Repository Structure

Infrastructure Management: All infrastructure is managed in the infra-core repository.

infra-core/
└── stacks/
    └── ml-databricks/
        ├── main.tf                 # Primary infrastructure definitions
        ├── providers.tf            # Terraform and Databricks providers
        ├── variables.tf            # Input variables
        ├── outputs.tf              # Output values
        ├── schemas.tf              # Unity Catalog schema definitions
        ├── backend.tf              # Remote state configuration
        └── .terraform/             # Terraform working directory

Related Repositories:

api-core - Backend API services that consume ML pipeline data
app-web - Frontend web application that displays insights
infra-core - Terraform infrastructure for all services

Infrastructure Components

The ML Databricks stack manages:

Networking:
- VPC (10.100.0.0/16 CIDR)
- Private subnets (2 AZs)
- Public subnets (NAT gateway)
- Security groups
- Route tables and internet gateway
Databricks Workspaces:
- Dev workspace (dbc-a72d6af9-df3d)
- Staging workspace (dbc-fab2a42a-8d11)
- Production workspace (dbc-028d9e53-7ce6)
Data Replication & Disaster Recovery:
- Primary Region: us-east-1 (US East - N. Virginia)
- DR Region: us-west-2 (US West - Oregon)
- S3 Cross-Region Replication: Automatic replication of prod data
- RPO: 24 hours | RTO: 4 hours
- Purpose: Business continuity and disaster recovery
See infra-core repository for Terraform configuration.
Unity Catalog:
- Metastore assignment
- Catalog creation (dev, staging, prod)
- Schema organization (bronze, silver, gold, models)
- External locations for S3 volumes
Service Principals:
- ml-pipelines-dev service principal
- ml-pipelines-staging service principal
- ml-pipelines-prod service principal
- Catalog and workspace permissions

Terraform State Management

Remote State Backend

The infrastructure uses S3 for remote state with DynamoDB for state locking.

Configuration (backend.tf):

terraform {
  backend "s3" {
    bucket         = "refresh-terraform-state"
    key            = "ml-databricks/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-state-lock"
    encrypt        = true
  }
}

State Operations

View current state:

cd /path/to/infra-core/stacks/ml-databricks
terraform state list

Show specific resource:

terraform state show 'module.databricks_workspaces["dev"].databricks_workspace.this'

Pull state locally (for backup):

terraform state pull > terraform.tfstate.backup

State Locking

DynamoDB table prevents concurrent modifications:

Table: terraform-state-lock
Lock acquired automatically on terraform apply
Lock released after operation completes

If state is locked:

# Check who has the lock (DynamoDB console)
# If lock is stale (>1 hour), force unlock (use carefully!)
terraform force-unlock <lock-id>

Workspace Management

Listing Workspaces

# Using Databricks CLI
databricks workspaces list --profile ref-admin

# Or with Terraform
terraform state list | grep databricks_workspace

Workspace Details

# Get workspace info
terraform output workspace_ids
terraform output workspace_urls

# Or directly from state
terraform state show 'module.databricks_workspaces["dev"].databricks_workspace.this'

Adding New Workspace

Edit variables.tf to add new workspace configuration:

variable "databricks_workspaces" {
  default = {
    dev = {
      enable_unity_catalog = true
      workspace_tier       = "ENTERPRISE"
    }
    staging = {
      enable_unity_catalog = true
      workspace_tier       = "ENTERPRISE"
    }
    prod = {
      enable_unity_catalog = true
      workspace_tier       = "ENTERPRISE"
    }
    # New workspace
    sandbox = {
      enable_unity_catalog = true
      workspace_tier       = "ENTERPRISE"
    }
  }
}

Plan the change:

terraform plan -target='module.databricks_workspaces["sandbox"]'

Apply if plan looks correct:

terraform apply -target='module.databricks_workspaces["sandbox"]'

Verify workspace created:
```
terraform output workspace_ids
```

Service Principal Management

Listing Service Principals

# Via Terraform state
terraform state list | grep databricks_service_principal

# Via Databricks CLI
databricks service-principals list --profile ref-admin

Service Principal Permissions

Service principals are managed with Terraform but permissions may need manual adjustment.

View current permissions:

# Catalog permissions
databricks grants get catalog dev --profile ref-dev

# Workspace permissions (Databricks UI)
# Admin Console → Service Principals → [SP] → Permissions

Grant additional permissions (if needed):

databricks sql execute "
  GRANT USE CATALOG ON dev TO '03ff99cd-a352-40bb-9d33-414c9ad9e7aa'
" --profile ref-admin

Service Principal OIDC Configuration

OIDC configuration is managed in Databricks Account Console, not Terraform:

Log into Account Console: https://accounts.cloud.databricks.com/
Navigate to: Service Principals → [Select SP] → Authentication
Click Add GitHub OIDC Configuration
Configure:
- Issuer: https://token.actions.githubusercontent.com
- Audience: https://<workspace>.cloud.databricks.com/oidc/v1/token
- Subject Pattern: repo:refresh-os/ml-pipelines:environment:{environment}

Unity Catalog Management

Catalog Structure

Managed in Terraform schemas.tf:

# Dev catalog
resource "databricks_catalog" "dev" {
  name    = "dev"
  comment = "Development catalog for ML pipelines"
}

# Schemas within catalog
resource "databricks_schema" "dev_bronze" {
  catalog_name = databricks_catalog.dev.id
  name         = "bronze"
  comment      = "Raw ingestion layer"
}

resource "databricks_schema" "dev_silver" {
  catalog_name = databricks_catalog.dev.id
  name         = "silver"
  comment      = "Cleaned and validated data"
}

resource "databricks_schema" "dev_gold" {
  catalog_name = databricks_catalog.dev.id
  name         = "gold"
  comment      = "Business-ready aggregations"
}

resource "databricks_schema" "dev_models" {
  catalog_name = databricks_catalog.dev.id
  name         = "models"
  comment      = "ML model registry"
}

Adding New Schema

Edit schemas.tf:

resource "databricks_schema" "dev_monitoring" {
  catalog_name = databricks_catalog.dev.id
  name         = "monitoring"
  comment      = "Monitoring and observability data"
}

Plan and apply:
```
terraform plan
terraform apply
```

External Locations for Volumes

External locations connect Unity Catalog to S3:

resource "databricks_external_location" "dev_volumes" {
  name            = "dev_volumes"
  url             = "s3://ref-ml-core-dev-workspace-bucket/dev/volumes/"
  credential_name = databricks_storage_credential.workspace_credential.id
  comment         = "External location for dev catalog volumes"
}

Verify external location:

databricks external-locations get dev_volumes --profile ref-dev

Drift Detection

What is Drift?

Drift occurs when infrastructure state diverges from Terraform configuration due to:

Manual changes in Databricks UI
Changes made by other tools/scripts
Deletions or modifications outside Terraform

Detecting Drift

Run terraform plan regularly to detect drift:

cd /path/to/infra-core/stacks/ml-databricks
terraform plan

Look for:

Resources to be updated (shows changed attributes)
Resources to be deleted (deleted outside Terraform)
Resources to be created (created outside Terraform)

Handling Drift

Option 1: Import Manual Changes into Terraform

If changes were intentional and should be kept:

# Import resource into state
terraform import databricks_schema.new_schema "dev.new_schema"

# Then update Terraform code to match

Option 2: Revert Manual Changes

If changes were accidental:

# Apply Terraform to revert to desired state
terraform apply

Option 3: Update Terraform to Match Manual Changes

If manual changes reflect new requirements:

Update Terraform code to match manual changes
Run terraform plan to verify no changes needed
Commit Terraform changes to git

Automated Drift Detection

Set up scheduled drift detection:

# Add to cron or GitHub Actions
0 9 * * * cd /path/to/infra-core/stacks/ml-databricks && terraform plan -detailed-exitcode | mail -s "Drift Detection Report" [email protected]

Infrastructure Changes Workflow

Standard Change Process

Follow this process for all infrastructure changes:

Step 1: Create Feature Branch

cd /path/to/infra-core
git checkout -b infra/add-monitoring-schema

Step 2: Make Infrastructure Changes

Edit Terraform files (main.tf, schemas.tf, etc.)

Step 3: Validate Syntax

terraform fmt -check      # Check formatting
terraform validate        # Validate syntax

Step 4: Plan Changes

terraform plan -out=tfplan

Review plan output carefully:

What resources will be created?
What will be modified?
What will be destroyed? (usually none)
Are changes expected?

Step 5: Create Pull Request

git add .
git commit -m "Add monitoring schema to dev catalog"
git push origin infra/add-monitoring-schema

Create PR in GitHub with:

Description of changes
Terraform plan output
Impact assessment
Rollback plan

Step 6: Review and Approval

PR review checklist:

Step 7: Apply Changes

After PR approval:

# Merge PR
git checkout main
git pull origin main

# Apply changes
cd stacks/ml-databricks
terraform apply tfplan

Step 8: Verify Changes

# Verify resources created
terraform state list

# Test functionality
databricks sql execute "SHOW SCHEMAS IN dev" --profile ref-dev

Step 9: Document Changes

Update documentation:

This guide (if infrastructure changed)
Architecture docs (if structure changed)
CHANGELOG.md

Emergency Changes

For production incidents requiring immediate infrastructure changes:

Document reason for bypassing normal process
Make change with terraform apply
Verify fix resolves incident
Create follow-up PR with same change for review
Document in incident report

Terraform Troubleshooting

Issue 1: State Lock Error

Error:

Error acquiring the state lock: ConditionalCheckFailedException

Cause: Another Terraform operation in progress or stale lock

Solution:

# Wait for other operation to complete (5-10 minutes)

# If lock is stale (>1 hour), force unlock
terraform force-unlock <lock-id>

Prevention: Ensure only one person runs Terraform at a time

Issue 2: Provider Authentication Error

Error:

Error: cannot authenticate to Databricks: token expired

Cause: Databricks credentials expired

Solution:

# Re-authenticate
export DATABRICKS_HOST="https://accounts.cloud.databricks.com"
export DATABRICKS_ACCOUNT_ID="<account_id>"
databricks auth login --host $DATABRICKS_HOST

# Or use service principal
export DATABRICKS_CLIENT_ID="<sp_id>"
export DATABRICKS_CLIENT_SECRET="<sp_secret>"

Issue 3: Resource Already Exists

Error:

Error: resource already exists

Cause: Resource created outside Terraform

Solution:

# Import existing resource
terraform import <resource_type>.<name> <resource_id>

# Example:
terraform import databricks_schema.dev_monitoring "dev.monitoring"

Issue 4: Workspace Not Found

Error:

Error: workspace not found

Cause: Workspace deleted or ID incorrect

Solution:

# Verify workspace exists
databricks workspaces list --profile ref-admin

# If deleted, recreate with Terraform
terraform apply

State Recovery Procedures

Scenario 1: State File Corrupted

Symptoms: Terraform errors on all operations

Recovery:

# 1. Download backup from S3
aws s3 cp s3://refresh-terraform-state/ml-databricks/terraform.tfstate.backup ./

# 2. Restore backup to S3
aws s3 cp terraform.tfstate.backup s3://refresh-terraform-state/ml-databricks/terraform.tfstate

# 3. Verify state
terraform state list

Scenario 2: State Out of Sync

Symptoms: Terraform wants to recreate existing resources

Recovery:

# 1. Pull current state
terraform state pull > current.tfstate

# 2. Refresh state from remote
terraform refresh

# 3. If still out of sync, import resources
terraform import <resource> <id>

Scenario 3: Lost State File

Symptoms: State file deleted, Terraform treats everything as new

Recovery:

# 1. Check S3 bucket versioning
aws s3api list-object-versions \
  --bucket refresh-terraform-state \
  --prefix ml-databricks/terraform.tfstate

# 2. Restore previous version
aws s3api get-object \
  --bucket refresh-terraform-state \
  --key ml-databricks/terraform.tfstate \
  --version-id <version-id> \
  terraform.tfstate

# 3. Push restored state to S3
aws s3 cp terraform.tfstate s3://refresh-terraform-state/ml-databricks/terraform.tfstate

Best Practices

1. Always Run Plan Before Apply

# Always preview changes
terraform plan

# Save plan for review
terraform plan -out=tfplan

# Apply saved plan
terraform apply tfplan

2. Use Targeted Operations Carefully

# Target specific resource (use sparingly)
terraform apply -target=databricks_schema.dev_monitoring

# Prefer applying entire configuration
terraform apply

3. Keep State Secure

State contains sensitive data (credentials, IDs)
S3 bucket has encryption enabled
Access restricted to infrastructure team
Never commit state to git

4. Document All Changes

Update this guide for infrastructure changes
Add comments to Terraform code
Document in PR descriptions
Update CHANGELOG.md

5. Test in Dev First

Make changes in dev workspace first
Validate functionality
Then promote to staging and prod

6. Regular State Backups

# Automated backup script
#!/bin/bash
DATE=$(date +%Y%m%d-%H%M%S)
terraform state pull > "backups/terraform.tfstate.$DATE"
aws s3 cp "backups/terraform.tfstate.$DATE" "s3://refresh-terraform-state/backups/ml-databricks/"

Useful Commands Reference

Terraform Commands

# Initialize working directory
terraform init

# Validate configuration
terraform validate

# Format code
terraform fmt

# Preview changes
terraform plan

# Apply changes
terraform apply

# Destroy resources (DANGEROUS!)
terraform destroy

# Show current state
terraform show

# List resources in state
terraform state list

# Show specific resource
terraform state show <resource>

# Import existing resource
terraform import <resource> <id>

# Refresh state from remote
terraform refresh

# Output values
terraform output

Databricks CLI Commands

# List workspaces
databricks workspaces list --profile ref-admin

# List catalogs
databricks catalogs list --profile ref-dev

# List schemas
databricks sql execute "SHOW SCHEMAS IN dev" --profile ref-dev

# List service principals
databricks service-principals list --profile ref-admin

# Check grants
databricks grants get catalog dev --profile ref-dev

AWS CLI Commands

# Check S3 state bucket
aws s3 ls s3://refresh-terraform-state/ml-databricks/

# Check DynamoDB lock table
aws dynamodb scan --table-name terraform-state-lock

# Verify IAM credentials
aws sts get-caller-identity

Deployment Guide - Application deployment procedures
Service Principals Guide - Service principal setup and management
Security Architecture - Security model and permissions
Unity Catalog Architecture - Catalog structure and governance
Troubleshooting Guide - Common infrastructure issues

Infrastructure Contacts

For infrastructure issues:

Infrastructure Team: [email protected]
Taylor Laing: [email protected] (Admin access)
#infrastructure: Slack channel for infrastructure questions

For emergencies:

Follow incident response procedures
Notify #incidents channel
Contact on-call engineer

PreviousDeployment Guide NextMonitoring and Observability

Last updated 5 months ago

hashtagOverview

hashtagInfrastructure Architecture

hashtagRepository Structure

hashtagInfrastructure Components

hashtagTerraform State Management

hashtagRemote State Backend

hashtagState Operations

hashtagState Locking

hashtagWorkspace Management

hashtagListing Workspaces

hashtagWorkspace Details

hashtagAdding New Workspace

hashtagService Principal Management

hashtagListing Service Principals

hashtagService Principal Permissions

hashtagService Principal OIDC Configuration

hashtagUnity Catalog Management

hashtagCatalog Structure

hashtagAdding New Schema

hashtagExternal Locations for Volumes

hashtagDrift Detection

hashtagWhat is Drift?

hashtagDetecting Drift

hashtagHandling Drift

hashtagAutomated Drift Detection

hashtagInfrastructure Changes Workflow

hashtagStandard Change Process

hashtagEmergency Changes

hashtagTerraform Troubleshooting

hashtagIssue 1: State Lock Error

hashtagIssue 2: Provider Authentication Error

hashtagIssue 3: Resource Already Exists

hashtagIssue 4: Workspace Not Found

hashtagState Recovery Procedures

hashtagScenario 1: State File Corrupted

hashtagScenario 2: State Out of Sync

hashtagScenario 3: Lost State File

hashtagBest Practices

hashtag1. Always Run Plan Before Apply

hashtag2. Use Targeted Operations Carefully

hashtag3. Keep State Secure

hashtag4. Document All Changes

hashtag5. Test in Dev First

hashtag6. Regular State Backups

hashtagUseful Commands Reference

hashtagTerraform Commands

hashtagDatabricks CLI Commands

hashtagAWS CLI Commands

hashtagRelated Documentation

hashtagInfrastructure Contacts

Overview

Infrastructure Architecture

Repository Structure

Infrastructure Components

Terraform State Management

Remote State Backend

State Operations

State Locking

Workspace Management

Listing Workspaces

Workspace Details

Adding New Workspace

Service Principal Management

Listing Service Principals

Service Principal Permissions

Service Principal OIDC Configuration

Unity Catalog Management

Catalog Structure

Adding New Schema

External Locations for Volumes

Drift Detection

What is Drift?

Detecting Drift

Handling Drift

Automated Drift Detection

Infrastructure Changes Workflow

Standard Change Process

Emergency Changes

Terraform Troubleshooting

Issue 1: State Lock Error

Issue 2: Provider Authentication Error

Issue 3: Resource Already Exists

Issue 4: Workspace Not Found

State Recovery Procedures

Scenario 1: State File Corrupted

Scenario 2: State Out of Sync

Scenario 3: Lost State File

Best Practices

1. Always Run Plan Before Apply

2. Use Targeted Operations Carefully

3. Keep State Secure

4. Document All Changes

5. Test in Dev First

6. Regular State Backups

Useful Commands Reference

Terraform Commands

Databricks CLI Commands

AWS CLI Commands

Related Documentation

Infrastructure Contacts