ADR-003: Service Principal Per Environment with GitHub OIDC

Status: Accepted

Date: 2025-09-20

Decision Makers: CTO

Technical Story: CI/CD authentication and authorization strategy

Context

The platform required an automated deployment strategy that would:

  1. Deploy to multiple environments (dev/staging/prod) via GitHub Actions

  2. Maintain least-privilege access (no single credential with access to all environments)

  3. Avoid storing long-lived secrets in GitHub

  4. Support audit requirements (who deployed what, when)

  5. Enable emergency credential revocation without breaking all deployments

Key security requirements:

  • No hardcoded credentials in GitHub repository

  • No long-lived secrets (passwords, tokens)

  • Clear audit trail of deployments

  • Ability to revoke access per environment independently

  • Support for future compliance certifications (SOC 2)

Decision

Implement service principal per environment with GitHub OIDC authentication (passwordless):

Service Principals:

  1. ml-pipelines-dev (UUID: 03ff99cd-a352-40bb-9d33-414c9ad9e7aa)

    • Full access to dev workspace and dev catalog

    • Used by: dev deployment workflow

  2. ml-pipelines-staging (UUID: 93bda7cf-b009-49d8-8e8d-046677c8597e)

    • Full access to staging workspace and staging catalog

    • Used by: staging deployment workflow

  3. ml-pipelines-prod (UUID: 2af4b95a-d80f-4da6-bfba-9ec5a8a8ec9f)

    • Full access to prod workspace and prod catalog

    • Used by: prod deployment workflow

Authentication Method: GitHub OIDC (OpenID Connect)

  • GitHub generates short-lived OIDC tokens

  • Tokens include repository, workflow, and environment context

  • Databricks validates tokens against federation policy

  • No secrets stored in GitHub

Federation Policy Pattern:

Consequences

Positive

  • No secrets in GitHub: Passwordless authentication eliminates secret management

  • Least privilege: Each service principal only accesses its environment

  • Independent revocation: Can revoke prod access without affecting dev/staging

  • Audit trail: OIDC tokens include workflow context (repo, branch, environment)

  • Short-lived tokens: OIDC tokens expire in minutes (vs. long-lived passwords)

  • Compliance ready: Passwordless auth supports SOC 2, ISO 27001 requirements

  • Scalable: Easy to add new environments (just create new service principal + policy)

  • Emergency response: Can disable service principal without touching GitHub

Negative

  • Initial complexity: OIDC setup more complex than password-based auth

  • Federation policy management: Policies must be configured in Databricks account console

  • Debugging complexity: OIDC token issues can be harder to troubleshoot

  • Provider lock-in: Tied to GitHub Actions (but industry standard)

Neutral

  • GitHub environment requirement: Must use GitHub environments for subject pattern

  • Account-level permissions: Requires Databricks account admin to set up federation policies

Alternatives Considered

Option 1: Single Service Principal with PAT (Personal Access Token)

Structure:

  • One service principal: ml-pipelines

  • Long-lived PAT stored in GitHub secrets

  • Full access to all environments

Pros:

  • Simpler setup

  • One credential to manage

Cons:

  • Violates least privilege

  • PAT is long-lived secret (security risk)

  • No environment isolation

  • Single point of failure

  • Cannot revoke access to one environment independently

Why rejected: Security risk too high. Single compromised PAT would grant access to production. No way to limit blast radius.

Option 2: GitHub App with Installation Access Tokens

Structure:

  • GitHub App installed on repository

  • App generates short-lived tokens

  • Databricks service principals use app tokens

Pros:

  • Short-lived tokens

  • GitHub-native authentication

Cons:

  • Requires GitHub App setup and maintenance

  • More complex than OIDC

  • Still requires service principal secrets on Databricks side

Why rejected: OIDC is simpler and more secure. GitHub OIDC is becoming industry standard for CI/CD.

Option 3: Separate GitHub Accounts per Environment

Structure:

  • Different GitHub repositories for dev/staging/prod

  • Each repo has own credentials

  • Physical separation

Pros:

  • Ultimate separation

  • Clear access boundaries

Cons:

  • Code duplication

  • Hard to promote code between environments

  • Terrible developer experience

  • Doesn't scale

Why rejected: Code duplication and poor developer experience are deal-breakers.

Implementation Notes

GitHub Workflow Configuration

Dev Deployment:

OIDC Token Contents (example):

Federation Policy Configuration

Created in Databricks Account Console (not in Terraform initially):

  • Issuer: https://token.actions.githubusercontent.com

  • Audience: Workspace OIDC URL

  • Subject pattern: repo:refresh-os/ml-pipelines:environment:{development|staging|production}

Future Enhancement: Add federation policies to Terraform for infrastructure-as-code.

Permission Model

Service Principal Workspace Access:

Service Principal Catalog Access:

Emergency Procedures

If Service Principal Compromised:

  1. Immediate: Disable service principal in Databricks account console

  2. Within 1 hour: Rotate service principal (create new, update workflows)

  3. Within 24 hours: Audit all deployments since potential compromise

  4. Post-mortem: Document incident and improve security

If GitHub OIDC Broken:

  1. Temporary: Create temporary PAT for emergency deployments (document exception)

  2. Investigation: Work with Databricks support to fix OIDC

  3. Resolution: Revert to OIDC once fixed, revoke temporary PAT

References

Last updated