Secret Rotation Runbook

Purpose

This runbook provides step-by-step procedures for rotating service principal credentials, GitHub OIDC configurations, and other secrets used in the ML Pipelines infrastructure. Regular rotation maintains security and prevents unauthorized access.

When to Rotate

  • Service Principal OIDC: Every 180 days (automatic via GitHub OIDC)

  • GitHub Personal Access Tokens: Every 90 days

  • Databricks Personal Access Tokens: Every 90 days (if used)

  • AWS IAM Credentials: Every 90 days (if not using roles)

Immediate Rotation (Security Events)

Rotate immediately if:

  • Credential exposure suspected or confirmed

  • Employee offboarding (within 24 hours)

  • Security audit findings

  • Compromised GitHub repository

  • Unauthorized access detected

  • Compliance requirement

Pre-Rotation Checklist

Before rotating any credentials:


Procedure 1: Service Principal OIDC Configuration

Overview

GitHub OIDC uses federated authentication, eliminating the need for long-lived secrets. The OIDC token is automatically issued by GitHub Actions and validated by Databricks. This procedure updates the trust configuration.

When to Rotate

  • Changing GitHub repository structure

  • Updating environment names

  • Modifying security policies

  • Annually as best practice

Step-by-Step Procedure

Step 1: Document Current Configuration

  1. Log into Databricks Account Console:

  2. Navigate to: Service Principals → Select service principal

  3. Click AuthenticationGitHub OIDC

  4. Document current settings:

  5. Save configuration to secure location

Step 2: Verify GitHub Workflow Configuration

  1. Check workflow files match OIDC config:

  2. Verify environment names:

Step 3: Update OIDC Configuration (If Needed)

  1. In Databricks Account Console → Service Principal → Authentication

  2. Click Edit on GitHub OIDC configuration

  3. Update fields:

    • Issuer: https://token.actions.githubusercontent.com (rarely changes)

    • Audience: Workspace-specific OIDC URL

    • Subject Pattern: repo:refresh-os/ml-pipelines:environment:{environment}

  4. Click Save

Step 4: Test OIDC Authentication

  1. Trigger test workflow in dev:

  2. Monitor GitHub Actions log for authentication:

  3. If successful, merge test branch

  4. If failed, check error message and rollback

Step 5: Verify All Environments

  1. Test dev environment:

  2. Test staging environment (after dev succeeds)

  3. Test production environment (after staging succeeds)

Step 6: Update Documentation

  1. Update this runbook with new configuration (if changed)

  2. Record rotation in security log:

Rollback Procedure

If OIDC authentication fails:

  1. Revert to previous subject pattern in Databricks Account Console

  2. Verify previous configuration works:

  3. If still failing, check:

    • Service principal still exists

    • Workspace URL correct

    • GitHub environment name matches

Verification

Timeline: 15-30 minutes (all environments)


Procedure 2: GitHub Personal Access Token (GH_PAT)

Overview

GitHub Personal Access Token is used in workflows for checking out code with submodules. This token should be rotated every 90 days.

When to Rotate

  • Every 90 days (scheduled)

  • Suspected exposure

  • Employee offboarding

  • Token expiration approaching

Step-by-Step Procedure

Step 1: Create New GitHub PAT

  1. Go to GitHub → Settings → Developer settings → Personal access tokens → Tokens (classic)

  2. Click Generate new token (classic)

  3. Configure token:

    • Note: ML Pipelines CI/CD - 2025-Q4

    • Expiration: 90 days

    • Scopes:

      • repo (full control of private repositories)

      • workflow (update GitHub Action workflows)

  4. Click Generate token

  5. Copy token immediately (shown only once)

Step 2: Update GitHub Secret

  1. Go to repository → Settings → Secrets and variables → Actions

  2. Find GH_PAT secret

  3. Click Update

  4. Paste new token value

  5. Click Update secret

Step 3: Test New Token

  1. Trigger a workflow that uses GH_PAT:

  2. Monitor workflow for submodule checkout:

  3. Verify no authentication errors

Step 4: Revoke Old Token

  1. Wait 24 hours to ensure new token works

  2. Go to GitHub → Settings → Developer settings → Personal access tokens

  3. Find old token (by note with previous date)

  4. Click Delete

  5. Confirm deletion

Step 5: Update Rotation Schedule

  1. Set reminder for 90 days:

  2. Document in security log

Rollback Procedure

If new token fails:

  1. Re-update GH_PAT secret with old token (if not yet revoked)

  2. Verify old token works:

  3. Investigate why new token failed:

    • Check token scopes

    • Verify token not expired

    • Ensure token associated with correct account

  4. Create another new token with correct configuration

Verification

Timeline: 10-15 minutes


Procedure 3: Service Principal Permissions Audit

Overview

Periodically audit service principal permissions to ensure least privilege access and identify unused permissions.

When to Audit

  • Quarterly (every 90 days)

  • After major infrastructure changes

  • Before compliance audits

  • After adding new environments

Step-by-Step Procedure

Step 1: List All Service Principals

  1. Get service principal IDs:

Step 2: Audit Catalog Permissions

For each environment:

Expected permissions:

Service Principal
Catalog
Permissions

ml-pipelines-dev

dev

ALL PRIVILEGES

ml-pipelines-staging

staging

ALL PRIVILEGES

ml-pipelines-staging

prod

USE CATALOG, SELECT (read-only)

ml-pipelines-prod

prod

ALL PRIVILEGES

ml-pipelines-prod

staging.models

USE CATALOG, SELECT (read-only)

Step 3: Audit Workspace Permissions

Check service principal can:

  • Create and manage jobs

  • Create and manage DLT pipelines

  • Create and manage model serving endpoints

  • Read/write to bundle directories

Step 4: Audit S3 Permissions

Verify external location access:

Step 5: Document Findings

Create audit report:

Step 6: Remediate Issues (If Found)

If excess permissions found:

If missing permissions:

Verification

Timeline: 30-45 minutes (all service principals)


Procedure 4: Emergency Credential Revocation

Overview

Use this procedure when credentials are compromised or suspected of compromise.

Immediate Actions (0-15 minutes)

Step 1: Revoke Compromised Credentials

For GitHub PAT:

For Service Principal:

Step 2: Notify Team

Post in Slack immediately:

Step 3: Assess Blast Radius

Check what systems are affected:

  • CI/CD pipelines blocked?

  • Manual deployments still work?

  • Production systems impacted?

  • Data access compromised?

Step 4: Review Access Logs

Check for unauthorized access:

Look for:

  • Unexpected API calls

  • Access from unknown IPs

  • Data exports or downloads

  • Permission changes

Recovery Actions (15-45 minutes)

Step 5: Create New Credentials

Follow Procedure 1 (OIDC) or Procedure 2 (PAT) above to create new credentials.

Step 6: Update All Systems

Update in priority order:

  1. Production environment (highest priority)

  2. Staging environment

  3. Dev environment

  4. Documentation

Step 7: Verify Security

Post-Incident Actions (1-24 hours)

Step 8: Root Cause Analysis

Document how compromise occurred:

  • Was token committed to repository?

  • Was token in logs?

  • Was token shared in Slack?

  • Was token phished?

Step 9: Implement Preventive Measures

Based on root cause:

  • Add git-secrets pre-commit hook

  • Update .gitignore

  • Security training

  • Process improvements

Step 10: Incident Report

Create incident report:

Verification

Timeline: 45-60 minutes (emergency rotation)


Post-Rotation Validation

After any credential rotation:

Validation Checklist

Test Commands


Rotation Schedule Template

Use this template to track rotations:



Emergency Contacts

For credential rotation issues:

  • Taylor Laing: [email protected] (Account Admin)

  • #ml-pipelines: Slack channel for team support

  • #security: Slack channel for security incidents

For security incidents:

  • Immediately notify #security channel

  • Follow incident response procedures

  • Document all actions taken

Last updated