ADR-003: Service Principal Per Environment with GitHub OIDC
Status: Accepted
Date: 2025-09-20
Decision Makers: CTO
Technical Story: CI/CD authentication and authorization strategy
Context
The platform required an automated deployment strategy that would:
Deploy to multiple environments (dev/staging/prod) via GitHub Actions
Maintain least-privilege access (no single credential with access to all environments)
Avoid storing long-lived secrets in GitHub
Support audit requirements (who deployed what, when)
Enable emergency credential revocation without breaking all deployments
Key security requirements:
No hardcoded credentials in GitHub repository
No long-lived secrets (passwords, tokens)
Clear audit trail of deployments
Ability to revoke access per environment independently
Support for future compliance certifications (SOC 2)
Decision
Implement service principal per environment with GitHub OIDC authentication (passwordless):
Service Principals:
ml-pipelines-dev(UUID:03ff99cd-a352-40bb-9d33-414c9ad9e7aa)Full access to dev workspace and dev catalog
Used by: dev deployment workflow
ml-pipelines-staging(UUID:93bda7cf-b009-49d8-8e8d-046677c8597e)Full access to staging workspace and staging catalog
Used by: staging deployment workflow
ml-pipelines-prod(UUID:2af4b95a-d80f-4da6-bfba-9ec5a8a8ec9f)Full access to prod workspace and prod catalog
Used by: prod deployment workflow
Authentication Method: GitHub OIDC (OpenID Connect)
GitHub generates short-lived OIDC tokens
Tokens include repository, workflow, and environment context
Databricks validates tokens against federation policy
No secrets stored in GitHub
Federation Policy Pattern:
Consequences
Positive
No secrets in GitHub: Passwordless authentication eliminates secret management
Least privilege: Each service principal only accesses its environment
Independent revocation: Can revoke prod access without affecting dev/staging
Audit trail: OIDC tokens include workflow context (repo, branch, environment)
Short-lived tokens: OIDC tokens expire in minutes (vs. long-lived passwords)
Compliance ready: Passwordless auth supports SOC 2, ISO 27001 requirements
Scalable: Easy to add new environments (just create new service principal + policy)
Emergency response: Can disable service principal without touching GitHub
Negative
Initial complexity: OIDC setup more complex than password-based auth
Federation policy management: Policies must be configured in Databricks account console
Debugging complexity: OIDC token issues can be harder to troubleshoot
Provider lock-in: Tied to GitHub Actions (but industry standard)
Neutral
GitHub environment requirement: Must use GitHub environments for subject pattern
Account-level permissions: Requires Databricks account admin to set up federation policies
Alternatives Considered
Option 1: Single Service Principal with PAT (Personal Access Token)
Structure:
One service principal:
ml-pipelinesLong-lived PAT stored in GitHub secrets
Full access to all environments
Pros:
Simpler setup
One credential to manage
Cons:
Violates least privilege
PAT is long-lived secret (security risk)
No environment isolation
Single point of failure
Cannot revoke access to one environment independently
Why rejected: Security risk too high. Single compromised PAT would grant access to production. No way to limit blast radius.
Option 2: GitHub App with Installation Access Tokens
Structure:
GitHub App installed on repository
App generates short-lived tokens
Databricks service principals use app tokens
Pros:
Short-lived tokens
GitHub-native authentication
Cons:
Requires GitHub App setup and maintenance
More complex than OIDC
Still requires service principal secrets on Databricks side
Why rejected: OIDC is simpler and more secure. GitHub OIDC is becoming industry standard for CI/CD.
Option 3: Separate GitHub Accounts per Environment
Structure:
Different GitHub repositories for dev/staging/prod
Each repo has own credentials
Physical separation
Pros:
Ultimate separation
Clear access boundaries
Cons:
Code duplication
Hard to promote code between environments
Terrible developer experience
Doesn't scale
Why rejected: Code duplication and poor developer experience are deal-breakers.
Implementation Notes
GitHub Workflow Configuration
Dev Deployment:
OIDC Token Contents (example):
Federation Policy Configuration
Created in Databricks Account Console (not in Terraform initially):
Issuer:
https://token.actions.githubusercontent.comAudience: Workspace OIDC URL
Subject pattern:
repo:refresh-os/ml-pipelines:environment:{development|staging|production}
Future Enhancement: Add federation policies to Terraform for infrastructure-as-code.
Permission Model
Service Principal Workspace Access:
Service Principal Catalog Access:
Emergency Procedures
If Service Principal Compromised:
Immediate: Disable service principal in Databricks account console
Within 1 hour: Rotate service principal (create new, update workflows)
Within 24 hours: Audit all deployments since potential compromise
Post-mortem: Document incident and improve security
If GitHub OIDC Broken:
Temporary: Create temporary PAT for emergency deployments (document exception)
Investigation: Work with Databricks support to fix OIDC
Resolution: Revert to OIDC once fixed, revoke temporary PAT
Related Decisions
References
Internal security review (Sept 2025)
Last updated