Secret Rotation Runbook
Purpose
This runbook provides step-by-step procedures for rotating service principal credentials, GitHub OIDC configurations, and other secrets used in the ML Pipelines infrastructure. Regular rotation maintains security and prevents unauthorized access.
When to Rotate
Scheduled Rotation (Recommended)
Service Principal OIDC: Every 180 days (automatic via GitHub OIDC)
GitHub Personal Access Tokens: Every 90 days
Databricks Personal Access Tokens: Every 90 days (if used)
AWS IAM Credentials: Every 90 days (if not using roles)
Immediate Rotation (Security Events)
Rotate immediately if:
Credential exposure suspected or confirmed
Employee offboarding (within 24 hours)
Security audit findings
Compromised GitHub repository
Unauthorized access detected
Compliance requirement
Pre-Rotation Checklist
Before rotating any credentials:
Procedure 1: Service Principal OIDC Configuration
Overview
GitHub OIDC uses federated authentication, eliminating the need for long-lived secrets. The OIDC token is automatically issued by GitHub Actions and validated by Databricks. This procedure updates the trust configuration.
When to Rotate
Changing GitHub repository structure
Updating environment names
Modifying security policies
Annually as best practice
Step-by-Step Procedure
Step 1: Document Current Configuration
Log into Databricks Account Console:
Navigate to: Service Principals → Select service principal
Click Authentication → GitHub OIDC
Document current settings:
Save configuration to secure location
Step 2: Verify GitHub Workflow Configuration
Check workflow files match OIDC config:
Verify environment names:
Step 3: Update OIDC Configuration (If Needed)
In Databricks Account Console → Service Principal → Authentication
Click Edit on GitHub OIDC configuration
Update fields:
Issuer:
https://token.actions.githubusercontent.com(rarely changes)Audience: Workspace-specific OIDC URL
Subject Pattern:
repo:refresh-os/ml-pipelines:environment:{environment}
Click Save
Step 4: Test OIDC Authentication
Trigger test workflow in dev:
Monitor GitHub Actions log for authentication:
If successful, merge test branch
If failed, check error message and rollback
Step 5: Verify All Environments
Test dev environment:
Test staging environment (after dev succeeds)
Test production environment (after staging succeeds)
Step 6: Update Documentation
Update this runbook with new configuration (if changed)
Record rotation in security log:
Rollback Procedure
If OIDC authentication fails:
Revert to previous subject pattern in Databricks Account Console
Verify previous configuration works:
If still failing, check:
Service principal still exists
Workspace URL correct
GitHub environment name matches
Verification
Timeline: 15-30 minutes (all environments)
Procedure 2: GitHub Personal Access Token (GH_PAT)
Overview
GitHub Personal Access Token is used in workflows for checking out code with submodules. This token should be rotated every 90 days.
When to Rotate
Every 90 days (scheduled)
Suspected exposure
Employee offboarding
Token expiration approaching
Step-by-Step Procedure
Step 1: Create New GitHub PAT
Go to GitHub → Settings → Developer settings → Personal access tokens → Tokens (classic)
Click Generate new token (classic)
Configure token:
Note:
ML Pipelines CI/CD - 2025-Q4Expiration: 90 days
Scopes:
repo(full control of private repositories)workflow(update GitHub Action workflows)
Click Generate token
Copy token immediately (shown only once)
Step 2: Update GitHub Secret
Go to repository → Settings → Secrets and variables → Actions
Find
GH_PATsecretClick Update
Paste new token value
Click Update secret
Step 3: Test New Token
Trigger a workflow that uses GH_PAT:
Monitor workflow for submodule checkout:
Verify no authentication errors
Step 4: Revoke Old Token
Wait 24 hours to ensure new token works
Go to GitHub → Settings → Developer settings → Personal access tokens
Find old token (by note with previous date)
Click Delete
Confirm deletion
Step 5: Update Rotation Schedule
Set reminder for 90 days:
Document in security log
Rollback Procedure
If new token fails:
Re-update
GH_PATsecret with old token (if not yet revoked)Verify old token works:
Investigate why new token failed:
Check token scopes
Verify token not expired
Ensure token associated with correct account
Create another new token with correct configuration
Verification
Timeline: 10-15 minutes
Procedure 3: Service Principal Permissions Audit
Overview
Periodically audit service principal permissions to ensure least privilege access and identify unused permissions.
When to Audit
Quarterly (every 90 days)
After major infrastructure changes
Before compliance audits
After adding new environments
Step-by-Step Procedure
Step 1: List All Service Principals
Get service principal IDs:
Step 2: Audit Catalog Permissions
For each environment:
Expected permissions:
ml-pipelines-dev
dev
ALL PRIVILEGES
ml-pipelines-staging
staging
ALL PRIVILEGES
ml-pipelines-staging
prod
USE CATALOG, SELECT (read-only)
ml-pipelines-prod
prod
ALL PRIVILEGES
ml-pipelines-prod
staging.models
USE CATALOG, SELECT (read-only)
Step 3: Audit Workspace Permissions
Check service principal can:
Create and manage jobs
Create and manage DLT pipelines
Create and manage model serving endpoints
Read/write to bundle directories
Step 4: Audit S3 Permissions
Verify external location access:
Step 5: Document Findings
Create audit report:
Step 6: Remediate Issues (If Found)
If excess permissions found:
If missing permissions:
Verification
Timeline: 30-45 minutes (all service principals)
Procedure 4: Emergency Credential Revocation
Overview
Use this procedure when credentials are compromised or suspected of compromise.
Immediate Actions (0-15 minutes)
Step 1: Revoke Compromised Credentials
For GitHub PAT:
For Service Principal:
Step 2: Notify Team
Post in Slack immediately:
Step 3: Assess Blast Radius
Check what systems are affected:
CI/CD pipelines blocked?
Manual deployments still work?
Production systems impacted?
Data access compromised?
Step 4: Review Access Logs
Check for unauthorized access:
Look for:
Unexpected API calls
Access from unknown IPs
Data exports or downloads
Permission changes
Recovery Actions (15-45 minutes)
Step 5: Create New Credentials
Follow Procedure 1 (OIDC) or Procedure 2 (PAT) above to create new credentials.
Step 6: Update All Systems
Update in priority order:
Production environment (highest priority)
Staging environment
Dev environment
Documentation
Step 7: Verify Security
Post-Incident Actions (1-24 hours)
Step 8: Root Cause Analysis
Document how compromise occurred:
Was token committed to repository?
Was token in logs?
Was token shared in Slack?
Was token phished?
Step 9: Implement Preventive Measures
Based on root cause:
Add git-secrets pre-commit hook
Update .gitignore
Security training
Process improvements
Step 10: Incident Report
Create incident report:
Verification
Timeline: 45-60 minutes (emergency rotation)
Post-Rotation Validation
After any credential rotation:
Validation Checklist
Test Commands
Rotation Schedule Template
Use this template to track rotations:
Related Documentation
Service Principals Guide - Detailed setup and permissions
Deployment Guide - CI/CD authentication
Security Architecture - Security model overview
Troubleshooting Guide - Authentication errors
CI/CD Pipeline Architecture - GitHub OIDC details
Emergency Contacts
For credential rotation issues:
Taylor Laing: [email protected] (Account Admin)
#ml-pipelines: Slack channel for team support
#security: Slack channel for security incidents
For security incidents:
Immediately notify #security channel
Follow incident response procedures
Document all actions taken
Last updated