ADR-002: Unity Catalog for Environment Isolation
Status: Accepted
Date: 2025-09-15
Decision Makers: CTO
Technical Story: Environment isolation and data governance strategy
Context
The platform required a strategy for:
Isolating data between environments (sandbox/dev/staging/prod)
Enabling data sharing without duplication
Providing fine-grained access control
Supporting regulatory compliance and audit requirements
Maintaining data lineage and governance
Key requirements:
Developers should not accidentally modify production data
Sandbox should reuse dev data without duplication (cost savings)
Clear ownership and permissions model
Support for future compliance needs (SOC 2, GDPR)
Decision
Use Databricks Unity Catalog with catalog-level isolation for environment separation:
Catalog Structure:
Schema Naming: Simple names (bronze, silver, gold) without environment prefixes, since catalog provides isolation.
Alternative Rejected: Using schema prefixes (e.g., dev_bronze) was rejected because catalog-level isolation is cleaner and more scalable.
Consequences
Positive
Strong isolation: Catalog boundaries prevent accidental cross-environment access
Zero-copy data sharing: Sandbox can read from dev without storage duplication
Clear ownership: Each catalog has defined owners (users or service principals)
Fine-grained permissions: Unity Catalog provides row/column-level security if needed
Audit trail: All data access logged in Unity Catalog audit logs
Compliance ready: Supports SOC 2, GDPR, HIPAA compliance requirements
Future-proof: Can add new catalogs/environments without restructuring
Metadata search: Central catalog makes data discovery easier
Negative
Catalog count: N+3 catalogs (N developers + dev + staging + prod)
Permission complexity: Each catalog needs permission configuration
Cross-catalog queries: Require explicit catalog references (e.g.,
SELECT * FROM dev.bronze.messages)Learning curve: Developers must understand catalog concept
Neutral
Shared metastore: All catalogs share metadata (good for discovery, potential single point of failure)
Naming conventions: Need clear standards to avoid confusion
Alternatives Considered
Option 1: Schema-Level Isolation (Single Catalog)
Structure:
Pros:
Simpler (single catalog)
Fewer permission boundaries
Cons:
Schema name pollution (
dev_taylor_bronze, etc.)Harder to enforce isolation (all schemas in one catalog)
No clear ownership boundaries
Doesn't scale well with many developers
Why rejected: Schema-level isolation doesn't provide strong enough boundaries. Too easy to accidentally query wrong schema (e.g., dev_bronze instead of prod_bronze).
Option 2: Separate Metastores per Environment
Structure:
Dev metastore → dev catalog
Staging metastore → staging catalog
Prod metastore → prod catalog
Pros:
Ultimate isolation
Complete separation of metadata
Cons:
Cannot share data across metastores (no zero-copy)
Complex setup and management
No central metadata discovery
Much higher cost (separate infrastructure per metastore)
Why rejected: Over-engineered for the use case. Unity Catalog provides sufficient isolation at catalog level without the overhead of separate metastores.
Option 3: Workspace-Level Isolation (No Unity Catalog)
Structure:
Each workspace has own data
No shared metadata
Pros:
Complete workspace isolation
Simple permissions (workspace-level)
Cons:
No zero-copy data sharing
No centralized governance
Hard to share data across environments
Loses Unity Catalog benefits (lineage, search, audit)
Why rejected: Unity Catalog is the future of Databricks data governance. Not using it would be technical debt from day one.
Implementation Notes
Data Sharing Pattern
Sandbox → Dev:
Staging → Prod (optional):
Permission Model
Sandbox Catalogs:
Owner: Individual developer (full access)
Read access: Databricks - Account Admin group (for support)
Dev Catalog:
Owner:
ml-pipelines-devservice principalWrite access:
ml-pipelines-devservice principalRead access: All developers, dev sandbox catalogs
Staging Catalog:
Owner:
ml-pipelines-stagingservice principalWrite access:
ml-pipelines-stagingservice principalRead access: Developers (run-only), staging service principal
Prod Catalog:
Owner:
ml-pipelines-prodservice principalWrite access:
ml-pipelines-prodservice principalRead access: Service principals only (no direct developer access)
Schema Structure
Each catalog has consistent schema structure:
bronze: Raw ingested datasilver: Cleaned and transformed datagold: Business-ready features and aggregationsmodels: ML model artifacts and metadata
This consistency makes code portable across environments.
Naming Conventions
Catalog Names:
Sandbox:
{databricks_username_before_@}_sandbox(e.g.,taylor_sandbox)Shared:
dev,staging,prod(simple names)
Table Names:
No environment prefixes (catalog provides context)
Descriptive names:
messages,calendar_events,work_itemsAvoid abbreviations unless industry-standard
Related Decisions
References
Last updated