ADR-002: Unity Catalog for Environment Isolation

Status: Accepted

Date: 2025-09-15

Decision Makers: CTO

Technical Story: Environment isolation and data governance strategy

Context

The platform required a strategy for:

Isolating data between environments (sandbox/dev/staging/prod)
Enabling data sharing without duplication
Providing fine-grained access control
Supporting regulatory compliance and audit requirements
Maintaining data lineage and governance

Key requirements:

Developers should not accidentally modify production data
Sandbox should reuse dev data without duplication (cost savings)
Clear ownership and permissions model
Support for future compliance needs (SOC 2, GDPR)

Decision

Use Databricks Unity Catalog with catalog-level isolation for environment separation:

Catalog Structure:

Unity Catalog Metastore (shared across all workspaces)
├── {username}_sandbox (per developer)
│   ├── bronze/    (empty - reads from dev)
│   ├── silver/    (empty - reads from dev)
│   ├── gold/      (developer experiments)
│   └── models/    (developer models)
├── dev
│   ├── bronze/    (shared raw data)
│   ├── silver/    (shared processed data)
│   ├── gold/      (dev features)
│   └── models/    (dev models)
├── staging
│   ├── bronze/    (reads from prod OR staging-specific)
│   ├── silver/    (reads from prod OR staging-specific)
│   ├── gold/      (staging validation)
│   └── models/    (staged models)
└── prod
    ├── bronze/    (production raw data)
    ├── silver/    (production processed)
    ├── gold/      (production features)
    └── models/    (production models - promoted from staging)

Schema Naming: Simple names (bronze, silver, gold) without environment prefixes, since catalog provides isolation.

Alternative Rejected: Using schema prefixes (e.g., dev_bronze) was rejected because catalog-level isolation is cleaner and more scalable.

Consequences

Positive

Strong isolation: Catalog boundaries prevent accidental cross-environment access
Zero-copy data sharing: Sandbox can read from dev without storage duplication
Clear ownership: Each catalog has defined owners (users or service principals)
Fine-grained permissions: Unity Catalog provides row/column-level security if needed
Audit trail: All data access logged in Unity Catalog audit logs
Compliance ready: Supports SOC 2, GDPR, HIPAA compliance requirements
Future-proof: Can add new catalogs/environments without restructuring
Metadata search: Central catalog makes data discovery easier

Negative

Catalog count: N+3 catalogs (N developers + dev + staging + prod)
Permission complexity: Each catalog needs permission configuration
Cross-catalog queries: Require explicit catalog references (e.g., SELECT * FROM dev.bronze.messages)
Learning curve: Developers must understand catalog concept

Neutral

Shared metastore: All catalogs share metadata (good for discovery, potential single point of failure)
Naming conventions: Need clear standards to avoid confusion

Alternatives Considered

Option 1: Schema-Level Isolation (Single Catalog)

Structure:

catalog: ml_pipelines
├── dev_bronze
├── dev_silver
├── staging_bronze
├── prod_bronze
└── ...

Pros:

Simpler (single catalog)
Fewer permission boundaries

Cons:

Schema name pollution (dev_taylor_bronze, etc.)
Harder to enforce isolation (all schemas in one catalog)
No clear ownership boundaries
Doesn't scale well with many developers

Why rejected: Schema-level isolation doesn't provide strong enough boundaries. Too easy to accidentally query wrong schema (e.g., dev_bronze instead of prod_bronze).

Option 2: Separate Metastores per Environment

Structure:

Dev metastore → dev catalog
Staging metastore → staging catalog
Prod metastore → prod catalog

Pros:

Ultimate isolation
Complete separation of metadata

Cons:

Cannot share data across metastores (no zero-copy)
Complex setup and management
No central metadata discovery
Much higher cost (separate infrastructure per metastore)

Why rejected: Over-engineered for the use case. Unity Catalog provides sufficient isolation at catalog level without the overhead of separate metastores.

Option 3: Workspace-Level Isolation (No Unity Catalog)

Structure:

Each workspace has own data
No shared metadata

Pros:

Complete workspace isolation
Simple permissions (workspace-level)

Cons:

No zero-copy data sharing
No centralized governance
Hard to share data across environments
Loses Unity Catalog benefits (lineage, search, audit)

Why rejected: Unity Catalog is the future of Databricks data governance. Not using it would be technical debt from day one.

Implementation Notes

Sandbox → Dev:

-- In sandbox pipeline, read from dev
SELECT * FROM dev.bronze.messages;

-- Process and write to sandbox
INSERT INTO taylor_sandbox.gold.features
SELECT ... FROM dev.bronze.messages;

Staging → Prod (optional):

-- Staging can read prod data for validation
SELECT * FROM prod.bronze.messages LIMIT 1000;

Permission Model

Sandbox Catalogs:

Owner: Individual developer (full access)
Read access: Databricks - Account Admin group (for support)

Dev Catalog:

Owner: ml-pipelines-dev service principal
Write access: ml-pipelines-dev service principal
Read access: All developers, dev sandbox catalogs

Staging Catalog:

Owner: ml-pipelines-staging service principal
Write access: ml-pipelines-staging service principal
Read access: Developers (run-only), staging service principal

Prod Catalog:

Owner: ml-pipelines-prod service principal
Write access: ml-pipelines-prod service principal
Read access: Service principals only (no direct developer access)

Schema Structure

Each catalog has consistent schema structure:

bronze: Raw ingested data
silver: Cleaned and transformed data
gold: Business-ready features and aggregations
models: ML model artifacts and metadata

This consistency makes code portable across environments.

Naming Conventions

Catalog Names:

Sandbox: {databricks_username_before_@}_sandbox (e.g., taylor_sandbox)
Shared: dev, staging, prod (simple names)

Table Names:

No environment prefixes (catalog provides context)
Descriptive names: messages, calendar_events, work_items
Avoid abbreviations unless industry-standard

References

PreviousADR-001: Four-Tier Deployment Architecture NextADR-003: Service Principal Per Environment with GitHub OIDC

Last updated 5 months ago

hashtagContext

hashtagDecision

hashtagConsequences

hashtagPositive

hashtagNegative

hashtagNeutral

hashtagAlternatives Considered

hashtagOption 1: Schema-Level Isolation (Single Catalog)

hashtagOption 2: Separate Metastores per Environment

hashtagOption 3: Workspace-Level Isolation (No Unity Catalog)

hashtagImplementation Notes

hashtagData Sharing Pattern

hashtagPermission Model

hashtagSchema Structure

hashtagNaming Conventions

hashtagRelated Decisions

hashtagReferences

Context

Decision

Consequences

Positive

Negative

Neutral

Alternatives Considered

Option 1: Schema-Level Isolation (Single Catalog)

Option 2: Separate Metastores per Environment

Option 3: Workspace-Level Isolation (No Unity Catalog)

Implementation Notes

Data Sharing Pattern

Permission Model

Schema Structure

Naming Conventions

Related Decisions

References