Unity Catalog Architecture

Overview

This document provides detailed documentation of the Unity Catalog structure used by the ML Pipelines platform, including catalog organization, schema design, permission model, governance policies, and cross-catalog access patterns.

Unity Catalog Hierarchy

Unity Catalog Metastore (us-east-1)

├── {username}_sandbox (per developer)
├────── .../
│   ├── gold/
│   │   ├── Tables: experimental features
│   │   └── Views: ad-hoc analysis
│   └── models/
│       └── ML models: experimental versions

├── dev
│   ├── bronze/
│   │   ├── messages
│   │   ├── user_activity
│   │   └── metadata
│   ├── silver/
│   │   ├── messages (cleaned)
│   │   ├── users
│   │   └── interactions
│   ├── gold/
│   │   ├── sentiment_features
│   │   ├── ml_features
│   │   └── daily_metrics
│   └── models/
│       ├── sentiment_analysis
│       └── emotion_detection

├── staging (pre-production)
│   ├── bronze/ (reads from prod)
│   ├── silver/ (reads from prod)
│   ├── gold/
│   │   ├── sentiment_features
│   │   └── ml_features
│   └── models/
│       ├── sentiment_analysis (trained on prod data)
│       └── emotion_detection

└── prod (production)
    ├── bronze/
    │   ├── messages
    │   ├── user_activity
    │   └── metadata
    ├── silver/
    │   ├── messages
    │   ├── users
    │   └── interactions
    ├── gold/
    │   ├── sentiment_features
    │   ├── ml_features
    │   ├── daily_metrics
    │   └── weekly_aggregations
    └── models/
        ├── sentiment_analysis (promoted from staging)
        └── emotion_detection

Catalog Structure

Sandbox Catalogs

Purpose: Individual developer experimentation and testing

Naming Convention: {username}_sandbox

  • Examples: taylor_sandbox, william_sandbox

Created: Automatically on first make deploy by developer

Lifecycle:

  • Created: On-demand per developer

  • Retained: Indefinitely (until explicitly dropped)

  • Cleaned: Developer responsibility

Schemas:

Data Source: Reads from dev.bronze.* and dev.silver.* (no data duplication)

Permissions:


Dev Catalog

Purpose: Shared development and integration testing

Catalog Name: dev

Schemas:

Managed By: ml-pipelines-dev service principal (via CI/CD)

Data Sources:

  • S3: s3://ref-ml-core-dev-workspace-bucket/dev/volumes/

  • Or: Sample/synthetic data for testing

Permissions:


Staging Catalog

Purpose: Pre-production validation with production data

Catalog Name: staging

Schemas:

Managed By: ml-pipelines-staging service principal

Key Principle: Trains models on production data to ensure realistic validation

Data Sources:

  • Reads from prod.bronze.* and prod.silver.* (read-only)

  • Writes to staging.gold.* and staging.models.*

Permissions:


Prod Catalog

Purpose: Production workloads serving real users

Catalog Name: prod

Schemas:

Managed By: ml-pipelines-prod service principal

Key Principle: Models are PROMOTED from staging, not retrained

Data Sources:

  • S3: s3://ref-ml-core-prod-workspace-bucket/prod/volumes/

  • Real production data

Permissions:


Schema Organization

Bronze Schema

Purpose: Raw data ingestion from source systems

Tables:

  • messages: Social media messages

  • user_activity: User interaction events

  • metadata: System metadata and configs

Characteristics:

  • Append-only (no updates/deletes)

  • Minimal transformation

  • Schema evolution enabled

  • Partitioned by ingestion date

Example Table:


Silver Schema

Purpose: Cleaned, validated, and deduplicated data

Tables:

  • messages: Cleaned messages with parsed metadata

  • users: User profiles

  • interactions: User interaction events

Characteristics:

  • Strong schema enforcement

  • Data quality expectations

  • Deduplication

  • Type conversions

Example Table:


Gold Schema

Purpose: Business-ready datasets for analytics and ML

Tables:

  • sentiment_features: ML features with sentiment predictions

  • ml_features: Feature tables for model training

  • daily_metrics: Aggregated business metrics

  • weekly_aggregations: Time-series aggregations

Characteristics:

  • Denormalized for performance

  • ML predictions included

  • Business metrics computed

  • Optimized for queries

Example Table:


Models Schema

Purpose: ML model registry (Unity Catalog integration)

Models (examples):

  • sentiment_analysis: Sentiment classification model

  • emotion_detection: Emotion recognition model

  • toxicity_detection: Content moderation model

Characteristics:

  • Versioned (MLflow integration)

  • Aliased (champion, challenger, archive)

  • Tracked lineage

  • Signature enforcement

Model Naming:

Model Metadata:


Permission Model

Least Privilege Principle

Each principal has minimal required permissions:

Service Principals:

  • Dev: Full access to dev only

  • Staging: Full access to staging, read-only to prod

  • Prod: Full access to prod, read-only to staging.models

Users:

  • Developers: Read-only to dev, full access to own sandbox

  • Lead: Full access to dev and staging, read-only to prod

  • Analysts: Read-only to *.gold schemas

Grant Statements

Create Catalog:

Grant Catalog Permissions:

Revoke Permissions:


Isolation Strategy

Environment Isolation

Read-Write Isolation:

  • Each environment has its own catalog

  • No cross-environment writes

  • Exception: Staging reads from prod (training data)

Diagram:

Developer Isolation

Sandbox Catalogs:

  • One catalog per developer

  • Zero conflicts between developers

  • Reads from shared dev for data

Benefits:

  • Rapid experimentation

  • Safe testing

  • No waiting for shared resources

  • Easy cleanup


Naming Conventions

Catalog Names

Pattern: {environment} or {username}_sandbox

Examples:

  • dev

  • staging

  • prod

  • taylor_sandbox

  • william_sandbox

Schema Names

Medallion Layers:

  • bronze - Raw ingestion

  • silver - Cleaned data

  • gold - Business-ready

Special Schemas:

  • models - ML model registry

  • monitoring - Observability data

  • experiments - Ad-hoc experiments

Table Names

Pattern: {domain}_{entity} (snake_case)

Examples:

  • messages

  • user_activity

  • sentiment_features

  • daily_metrics

Model Names

Pattern: {model_purpose} (snake_case)

Examples:

  • sentiment_analysis

  • emotion_detection

  • toxicity_classifier


Governance Policies

Data Classification

Public: No restrictions (e.g., aggregated metrics) Internal: Restricted to employees (e.g., user behavior) Confidential: Restricted access (e.g., PII) Restricted: Admin-only (e.g., security logs)

Implementation (via tags):

Data Retention

Bronze: 90 days (regulatory compliance) Silver: 365 days (historical analysis) Gold: 730 days (long-term trends) Models: Permanent (all versions retained)

Enforcement:

Data Quality

Expectations: Defined in DLT pipelines

Bronze: Permissive (log failures) Silver: Enforcing (drop invalid) Gold: Strict (fail pipeline)


Cross-Catalog Access Patterns

Pattern 1: Staging Reads Prod (Model Training)

Use Case: Train models on production data

Implementation:

Permissions:


Pattern 2: Prod Reads Staging (Model Promotion)

Use Case: Promote model binary from staging to prod

Implementation:

Permissions:


Pattern 3: Sandbox Reads Dev (Data Exploration)

Use Case: Developer experiments with shared dev data

Implementation:

Permissions:


Best Practices

1. Always Use Three-Part Names

2. Grant Permissions at Catalog Level (Not Table)

3. Use Service Principals for Automation

4. Document Catalog Purpose

5. Tag Resources for Discovery


Last updated