ADR-002: Unity Catalog for Environment Isolation

Status: Accepted

Date: 2025-09-15

Decision Makers: CTO

Technical Story: Environment isolation and data governance strategy

Context

The platform required a strategy for:

  1. Isolating data between environments (sandbox/dev/staging/prod)

  2. Enabling data sharing without duplication

  3. Providing fine-grained access control

  4. Supporting regulatory compliance and audit requirements

  5. Maintaining data lineage and governance

Key requirements:

  • Developers should not accidentally modify production data

  • Sandbox should reuse dev data without duplication (cost savings)

  • Clear ownership and permissions model

  • Support for future compliance needs (SOC 2, GDPR)

Decision

Use Databricks Unity Catalog with catalog-level isolation for environment separation:

Catalog Structure:

Schema Naming: Simple names (bronze, silver, gold) without environment prefixes, since catalog provides isolation.

Alternative Rejected: Using schema prefixes (e.g., dev_bronze) was rejected because catalog-level isolation is cleaner and more scalable.

Consequences

Positive

  • Strong isolation: Catalog boundaries prevent accidental cross-environment access

  • Zero-copy data sharing: Sandbox can read from dev without storage duplication

  • Clear ownership: Each catalog has defined owners (users or service principals)

  • Fine-grained permissions: Unity Catalog provides row/column-level security if needed

  • Audit trail: All data access logged in Unity Catalog audit logs

  • Compliance ready: Supports SOC 2, GDPR, HIPAA compliance requirements

  • Future-proof: Can add new catalogs/environments without restructuring

  • Metadata search: Central catalog makes data discovery easier

Negative

  • Catalog count: N+3 catalogs (N developers + dev + staging + prod)

  • Permission complexity: Each catalog needs permission configuration

  • Cross-catalog queries: Require explicit catalog references (e.g., SELECT * FROM dev.bronze.messages)

  • Learning curve: Developers must understand catalog concept

Neutral

  • Shared metastore: All catalogs share metadata (good for discovery, potential single point of failure)

  • Naming conventions: Need clear standards to avoid confusion

Alternatives Considered

Option 1: Schema-Level Isolation (Single Catalog)

Structure:

Pros:

  • Simpler (single catalog)

  • Fewer permission boundaries

Cons:

  • Schema name pollution (dev_taylor_bronze, etc.)

  • Harder to enforce isolation (all schemas in one catalog)

  • No clear ownership boundaries

  • Doesn't scale well with many developers

Why rejected: Schema-level isolation doesn't provide strong enough boundaries. Too easy to accidentally query wrong schema (e.g., dev_bronze instead of prod_bronze).

Option 2: Separate Metastores per Environment

Structure:

  • Dev metastore → dev catalog

  • Staging metastore → staging catalog

  • Prod metastore → prod catalog

Pros:

  • Ultimate isolation

  • Complete separation of metadata

Cons:

  • Cannot share data across metastores (no zero-copy)

  • Complex setup and management

  • No central metadata discovery

  • Much higher cost (separate infrastructure per metastore)

Why rejected: Over-engineered for the use case. Unity Catalog provides sufficient isolation at catalog level without the overhead of separate metastores.

Option 3: Workspace-Level Isolation (No Unity Catalog)

Structure:

  • Each workspace has own data

  • No shared metadata

Pros:

  • Complete workspace isolation

  • Simple permissions (workspace-level)

Cons:

  • No zero-copy data sharing

  • No centralized governance

  • Hard to share data across environments

  • Loses Unity Catalog benefits (lineage, search, audit)

Why rejected: Unity Catalog is the future of Databricks data governance. Not using it would be technical debt from day one.

Implementation Notes

Data Sharing Pattern

Sandbox → Dev:

Staging → Prod (optional):

Permission Model

Sandbox Catalogs:

  • Owner: Individual developer (full access)

  • Read access: Databricks - Account Admin group (for support)

Dev Catalog:

  • Owner: ml-pipelines-dev service principal

  • Write access: ml-pipelines-dev service principal

  • Read access: All developers, dev sandbox catalogs

Staging Catalog:

  • Owner: ml-pipelines-staging service principal

  • Write access: ml-pipelines-staging service principal

  • Read access: Developers (run-only), staging service principal

Prod Catalog:

  • Owner: ml-pipelines-prod service principal

  • Write access: ml-pipelines-prod service principal

  • Read access: Service principals only (no direct developer access)

Schema Structure

Each catalog has consistent schema structure:

  • bronze: Raw ingested data

  • silver: Cleaned and transformed data

  • gold: Business-ready features and aggregations

  • models: ML model artifacts and metadata

This consistency makes code portable across environments.

Naming Conventions

Catalog Names:

  • Sandbox: {databricks_username_before_@}_sandbox (e.g., taylor_sandbox)

  • Shared: dev, staging, prod (simple names)

Table Names:

  • No environment prefixes (catalog provides context)

  • Descriptive names: messages, calendar_events, work_items

  • Avoid abbreviations unless industry-standard

References

Last updated