Unity Catalog Architecture

Overview

This document provides detailed documentation of the Unity Catalog structure used by the ML Pipelines platform, including catalog organization, schema design, permission model, governance policies, and cross-catalog access patterns.

Unity Catalog Hierarchy

Unity Catalog Metastore (us-east-1)
│
├── {username}_sandbox (per developer)
├────── .../
│   ├── gold/
│   │   ├── Tables: experimental features
│   │   └── Views: ad-hoc analysis
│   └── models/
│       └── ML models: experimental versions
│
├── dev
│   ├── bronze/
│   │   ├── messages
│   │   ├── user_activity
│   │   └── metadata
│   ├── silver/
│   │   ├── messages (cleaned)
│   │   ├── users
│   │   └── interactions
│   ├── gold/
│   │   ├── sentiment_features
│   │   ├── ml_features
│   │   └── daily_metrics
│   └── models/
│       ├── sentiment_analysis
│       └── emotion_detection
│
├── staging (pre-production)
│   ├── bronze/ (reads from prod)
│   ├── silver/ (reads from prod)
│   ├── gold/
│   │   ├── sentiment_features
│   │   └── ml_features
│   └── models/
│       ├── sentiment_analysis (trained on prod data)
│       └── emotion_detection
│
└── prod (production)
    ├── bronze/
    │   ├── messages
    │   ├── user_activity
    │   └── metadata
    ├── silver/
    │   ├── messages
    │   ├── users
    │   └── interactions
    ├── gold/
    │   ├── sentiment_features
    │   ├── ml_features
    │   ├── daily_metrics
    │   └── weekly_aggregations
    └── models/
        ├── sentiment_analysis (promoted from staging)
        └── emotion_detection

Catalog Structure

Sandbox Catalogs

Purpose: Individual developer experimentation and testing

Naming Convention: {username}_sandbox

Examples: taylor_sandbox, william_sandbox

Created: Automatically on first make deploy by developer

Lifecycle:

Created: On-demand per developer
Retained: Indefinitely (until explicitly dropped)
Cleaned: Developer responsibility

Schemas:

{username}_sandbox/
├── gold/           # Experimental features and transformations
└── models/         # Experimental ML models

Data Source: Reads from dev.bronze.* and dev.silver.* (no data duplication)

Permissions:

-- Full access to owner
GRANT ALL PRIVILEGES ON CATALOG taylor_sandbox TO `[email protected]`;

-- Read access to dev
GRANT USE CATALOG ON dev TO `[email protected]`;
GRANT SELECT ON dev.bronze.* TO `[email protected]`;
GRANT SELECT ON dev.silver.* TO `[email protected]`;

Dev Catalog

Purpose: Shared development and integration testing

Catalog Name: dev

Schemas:

dev/
├── bronze/         # Raw data ingestion
├── silver/         # Cleaned and validated
├── gold/           # Feature engineering
└── models/         # Development model versions

Managed By: ml-pipelines-dev service principal (via CI/CD)

Data Sources:

S3: s3://ref-ml-core-dev-workspace-bucket/dev/volumes/
Or: Sample/synthetic data for testing

Permissions:

-- Service principal: full access
GRANT ALL PRIVILEGES ON CATALOG dev TO '03ff99cd-a352-40bb-9d33-414c9ad9e7aa';

-- Developers: read-only
GRANT USE CATALOG ON dev TO `developers`;
GRANT SELECT ON dev.* TO `developers`;

Staging Catalog

Purpose: Pre-production validation with production data

Catalog Name: staging

Schemas:

staging/
├── bronze/         # Links to prod.bronze (via views)
├── silver/         # Links to prod.silver (via views)
├── gold/           # Staging transformations (on prod data)
└── models/         # Models trained on prod data

Managed By: ml-pipelines-staging service principal

Key Principle: Trains models on production data to ensure realistic validation

Data Sources:

Reads from prod.bronze.* and prod.silver.* (read-only)
Writes to staging.gold.* and staging.models.*

Permissions:

-- Service principal: full access to staging
GRANT ALL PRIVILEGES ON CATALOG staging TO '93bda7cf-b009-49d8-8e8d-046677c8597e';

-- Service principal: read-only to prod
GRANT USE CATALOG ON prod TO '93bda7cf-b009-49d8-8e8d-046677c8597e';
GRANT SELECT ON prod.bronze.* TO '93bda7cf-b009-49d8-8e8d-046677c8597e';
GRANT SELECT ON prod.silver.* TO '93bda7cf-b009-49d8-8e8d-046677c8597e';

-- Databricks - Staging group: can run jobs, view results
GRANT CAN_RUN ON CATALOG staging TO `Databricks - Staging`;
GRANT CAN_VIEW ON CATALOG staging TO `Databricks - Staging`;

Prod Catalog

Purpose: Production workloads serving real users

Catalog Name: prod

Schemas:

prod/
├── bronze/         # Production raw data
├── silver/         # Production processed data
├── gold/           # Production features and metrics
└── models/         # Production models (promoted from staging)

Managed By: ml-pipelines-prod service principal

Key Principle: Models are PROMOTED from staging, not retrained

Data Sources:

S3: s3://ref-ml-core-prod-workspace-bucket/prod/volumes/
Real production data

Permissions:

-- Service principal: full access to prod
GRANT ALL PRIVILEGES ON CATALOG prod TO '2af4b95a-d80f-4da6-bfba-9ec5a8a8ec9f';

-- Service principal: read-only to staging.models (for promotion)
GRANT USE CATALOG ON staging TO '2af4b95a-d80f-4da6-bfba-9ec5a8a8ec9f';
GRANT SELECT ON staging.models.* TO '2af4b95a-d80f-4da6-bfba-9ec5a8a8ec9f';

-- Lead developer: read-only (debugging)
GRANT USE CATALOG ON prod TO `[email protected]`;
GRANT SELECT ON prod.* TO `[email protected]`;

-- Analysts: read-only to gold
GRANT USE CATALOG ON prod TO `analysts`;
GRANT SELECT ON prod.gold.* TO `analysts`;

Schema Organization

Bronze Schema

Purpose: Raw data ingestion from source systems

Tables:

messages: Social media messages
user_activity: User interaction events
metadata: System metadata and configs

Characteristics:

Append-only (no updates/deletes)
Minimal transformation
Schema evolution enabled
Partitioned by ingestion date

Example Table:

CREATE TABLE bronze.messages (
    message_id STRING NOT NULL,
    platform STRING NOT NULL,
    content STRING,
    author_id STRING,
    created_date TIMESTAMP,
    metadata STRING,
    _ingestion_timestamp TIMESTAMP,
    _source_file STRING
)
PARTITIONED BY (DATE(_ingestion_timestamp))
TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true');

Silver Schema

Purpose: Cleaned, validated, and deduplicated data

Tables:

messages: Cleaned messages with parsed metadata
users: User profiles
interactions: User interaction events

Characteristics:

Strong schema enforcement
Data quality expectations
Deduplication
Type conversions

Example Table:

CREATE TABLE silver.messages (
    message_id STRING PRIMARY KEY,
    platform STRING NOT NULL,
    content STRING NOT NULL,
    content_length INT,
    author_id STRING,
    created_date TIMESTAMP NOT NULL,
    language STRING,
    metadata STRUCT<
        source: STRING,
        device: STRING,
        location: STRING
    >,
    _processing_timestamp TIMESTAMP
)
PARTITIONED BY (DATE(created_date))
TBLPROPERTIES (
    'delta.autoOptimize.optimizeWrite' = 'true',
    'delta.autoOptimize.autoCompact' = 'true'
);

Gold Schema

Purpose: Business-ready datasets for analytics and ML

Tables:

sentiment_features: ML features with sentiment predictions
ml_features: Feature tables for model training
daily_metrics: Aggregated business metrics
weekly_aggregations: Time-series aggregations

Characteristics:

Denormalized for performance
ML predictions included
Business metrics computed
Optimized for queries

Example Table:

CREATE TABLE gold.sentiment_features (
    message_id STRING PRIMARY KEY,
    content STRING,
    sentiment STRING,
    sentiment_score DECIMAL(10,6),
    emotion STRING,
    toxicity_score DECIMAL(10,6),
    key_phrases ARRAY<STRING>,
    entities ARRAY<STRUCT<text:STRING, type:STRING>>,
    created_date TIMESTAMP,
    _prediction_timestamp TIMESTAMP
)
CLUSTER BY (sentiment, created_date);

Models Schema

Purpose: ML model registry (Unity Catalog integration)

Models (examples):

sentiment_analysis: Sentiment classification model
emotion_detection: Emotion recognition model
toxicity_detection: Content moderation model

Characteristics:

Versioned (MLflow integration)
Aliased (champion, challenger, archive)
Tracked lineage
Signature enforcement

Model Naming:

{catalog}.models.{model_name}

Examples:
- dev.models.sentiment_analysis
- staging.models.sentiment_analysis
- prod.models.sentiment_analysis

Model Metadata:

# Model registration with tags
mlflow.pyfunc.log_model(
    python_model=model,
    artifact_path="model",
    registered_model_name=f"{catalog}.models.sentiment_analysis",
    signature=signature,
    pip_requirements=requirements
)

# Set tags
client = MlflowClient()
client.set_model_version_tag(
    name=f"{catalog}.models.sentiment_analysis",
    version="1",
    key="git_commit",
    value="abc123"
)

Permission Model

Least Privilege Principle

Each principal has minimal required permissions:

Service Principals:

Dev: Full access to dev only
Staging: Full access to staging, read-only to prod
Prod: Full access to prod, read-only to staging.models

Users:

Developers: Read-only to dev, full access to own sandbox
Lead: Full access to dev and staging, read-only to prod
Analysts: Read-only to *.gold schemas

Grant Statements

Create Catalog:

CREATE CATALOG IF NOT EXISTS dev
COMMENT 'Development catalog for ML pipelines';

Grant Catalog Permissions:

-- Service principal
GRANT ALL PRIVILEGES ON CATALOG dev TO '03ff99cd-a352-40bb-9d33-414c9ad9e7aa';

-- Group
GRANT USE CATALOG ON dev TO `developers`;
GRANT SELECT ON dev.* TO `developers`;

-- Individual user
GRANT ALL PRIVILEGES ON CATALOG taylor_sandbox TO `[email protected]`;

Revoke Permissions:

REVOKE SELECT ON dev.bronze.messages FROM `[email protected]`;

Isolation Strategy

Environment Isolation

Read-Write Isolation:

Each environment has its own catalog
No cross-environment writes
Exception: Staging reads from prod (training data)

Diagram:

Sandbox → Read → Dev ⊗ Write
Dev → ⊗ Read/Write → Staging
Dev → ⊗ Read/Write → Prod

Staging → Read → Prod ⊗ Write
Staging → ⊗ Read/Write → Dev

Prod → ⊗ Read/Write → Dev
Prod → ⊗ Read/Write → Staging
Prod → Read → Staging.models ⊗ Write (promotion only)

Developer Isolation

Sandbox Catalogs:

One catalog per developer
Zero conflicts between developers
Reads from shared dev for data

Benefits:

Rapid experimentation
Safe testing
No waiting for shared resources
Easy cleanup

Naming Conventions

Catalog Names

Pattern: {environment} or {username}_sandbox

Examples:

dev
staging
prod
taylor_sandbox
william_sandbox

Schema Names

Medallion Layers:

bronze - Raw ingestion
silver - Cleaned data
gold - Business-ready

Special Schemas:

models - ML model registry
monitoring - Observability data
experiments - Ad-hoc experiments

Table Names

Pattern: {domain}_{entity} (snake_case)

Examples:

messages
user_activity
sentiment_features
daily_metrics

Model Names

Pattern: {model_purpose} (snake_case)

Examples:

sentiment_analysis
emotion_detection
toxicity_classifier

Governance Policies

Data Classification

Public: No restrictions (e.g., aggregated metrics) Internal: Restricted to employees (e.g., user behavior) Confidential: Restricted access (e.g., PII) Restricted: Admin-only (e.g., security logs)

Implementation (via tags):

ALTER TABLE prod.silver.messages
SET TAGS ('classification' = 'internal', 'pii' = 'true');

Data Retention

Bronze: 90 days (regulatory compliance) Silver: 365 days (historical analysis) Gold: 730 days (long-term trends) Models: Permanent (all versions retained)

Enforcement:

-- Automated cleanup job
DELETE FROM bronze.messages
WHERE _ingestion_timestamp < CURRENT_DATE() - INTERVAL 90 DAYS;

VACUUM bronze.messages RETAIN 168 HOURS;

Data Quality

Expectations: Defined in DLT pipelines

Bronze: Permissive (log failures) Silver: Enforcing (drop invalid) Gold: Strict (fail pipeline)

Cross-Catalog Access Patterns

Pattern 1: Staging Reads Prod (Model Training)

Use Case: Train models on production data

Implementation:

-- In staging workspace
CREATE VIEW staging.bronze.messages AS
SELECT * FROM prod.bronze.messages;

-- Use in training pipeline
CREATE TABLE staging.gold.ml_features AS
SELECT
    message_id,
    content,
    sentiment_label
FROM staging.bronze.messages;  -- Actually reading from prod

Permissions:

GRANT USE CATALOG ON prod TO '93bda7cf-b009-49d8-8e8d-046677c8597e';
GRANT SELECT ON prod.bronze.* TO '93bda7cf-b009-49d8-8e8d-046677c8597e';
GRANT SELECT ON prod.silver.* TO '93bda7cf-b009-49d8-8e8d-046677c8597e';

Pattern 2: Prod Reads Staging (Model Promotion)

Use Case: Promote model binary from staging to prod

Implementation:

from mlflow.tracking import MlflowClient

client = MlflowClient()

# Copy model from staging to prod
staging_model_uri = "models:/staging.models.sentiment_analysis@champion"

mlflow.register_model(
    model_uri=staging_model_uri,
    name="prod.models.sentiment_analysis",
    tags={"promoted_from": "staging", "version": "2"}
)

Permissions:

GRANT USE CATALOG ON staging TO '2af4b95a-d80f-4da6-bfba-9ec5a8a8ec9f';
GRANT SELECT ON staging.models.* TO '2af4b95a-d80f-4da6-bfba-9ec5a8a8ec9f';

Pattern 3: Sandbox Reads Dev (Data Exploration)

Use Case: Developer experiments with shared dev data

Implementation:

# In sandbox notebook
dev_messages = spark.table("dev.bronze.messages")

# Transform and write to sandbox
experimental_features = (
    dev_messages
        .select("message_id", "content")
        .withColumn("feature_1", ...)
)

experimental_features.write.saveAsTable("taylor_sandbox.gold.features_v1")

Permissions:

-- Developer reads from dev
GRANT USE CATALOG ON dev TO `[email protected]`;
GRANT SELECT ON dev.bronze.* TO `[email protected]`;
GRANT SELECT ON dev.silver.* TO `[email protected]`;

-- Developer writes to own sandbox
GRANT ALL PRIVILEGES ON CATALOG taylor_sandbox TO `[email protected]`;

Best Practices

1. Always Use Three-Part Names

-- Good: Explicit catalog and schema
SELECT * FROM prod.silver.messages;

-- Bad: Ambiguous (depends on current catalog)
SELECT * FROM silver.messages;

2. Grant Permissions at Catalog Level (Not Table)

-- Good: Manage permissions at catalog level
GRANT SELECT ON CATALOG dev TO `developers`;

-- Avoid: Granting per-table (hard to maintain)
GRANT SELECT ON dev.bronze.messages TO `developers`;
GRANT SELECT ON dev.bronze.users TO `developers`;
...

3. Use Service Principals for Automation

-- Good: Service principal owns resources
GRANT ALL PRIVILEGES ON CATALOG dev TO '03ff99cd-a352-40bb-9d33-414c9ad9e7aa';

-- Bad: User account owns resources
GRANT ALL PRIVILEGES ON CATALOG dev TO `[email protected]`;

4. Document Catalog Purpose

CREATE CATALOG dev
COMMENT 'Development catalog for ML pipelines. Managed by ml-pipelines-dev service principal.';

ALTER SCHEMA dev.bronze SET COMMENT 'Raw data ingestion layer. Append-only tables.';

5. Tag Resources for Discovery

ALTER TABLE prod.gold.sentiment_features SET TAGS (
    'layer' = 'gold',
    'domain' = 'sentiment',
    'pii' = 'false',
    'owner' = 'ml-team'
);

ARCHITECTURE.md - Overall architecture
Security Architecture - Permission model details
Data Flow Architecture - Data pipeline structure
Service Principals Guide - Authentication setup
Deployment Guide - Environment management

PreviousSecurity Architecture NextDecisions

Last updated 5 months ago

hashtagOverview

hashtagUnity Catalog Hierarchy

hashtagCatalog Structure

hashtagSandbox Catalogs

hashtagDev Catalog

hashtagStaging Catalog

hashtagProd Catalog

hashtagSchema Organization

hashtagBronze Schema

hashtagSilver Schema

hashtagGold Schema

hashtagModels Schema

hashtagPermission Model

hashtagLeast Privilege Principle

hashtagGrant Statements

hashtagIsolation Strategy

hashtagEnvironment Isolation

hashtagDeveloper Isolation

hashtagNaming Conventions

hashtagCatalog Names

hashtagSchema Names

hashtagTable Names

hashtagModel Names

hashtagGovernance Policies

hashtagData Classification

hashtagData Retention

hashtagData Quality

hashtagCross-Catalog Access Patterns

hashtagPattern 1: Staging Reads Prod (Model Training)

hashtagPattern 2: Prod Reads Staging (Model Promotion)

hashtagPattern 3: Sandbox Reads Dev (Data Exploration)

hashtagBest Practices

hashtag1. Always Use Three-Part Names

hashtag2. Grant Permissions at Catalog Level (Not Table)

hashtag3. Use Service Principals for Automation

hashtag4. Document Catalog Purpose

hashtag5. Tag Resources for Discovery

hashtagRelated Documentation

Overview

Unity Catalog Hierarchy

Catalog Structure

Sandbox Catalogs

Dev Catalog

Staging Catalog

Prod Catalog

Schema Organization

Bronze Schema

Silver Schema

Gold Schema

Models Schema

Permission Model

Least Privilege Principle

Grant Statements

Isolation Strategy

Environment Isolation

Developer Isolation

Naming Conventions

Catalog Names

Schema Names

Table Names

Model Names

Governance Policies

Data Classification

Data Retention

Data Quality

Cross-Catalog Access Patterns

Pattern 1: Staging Reads Prod (Model Training)

Pattern 2: Prod Reads Staging (Model Promotion)

Pattern 3: Sandbox Reads Dev (Data Exploration)

Best Practices

1. Always Use Three-Part Names

2. Grant Permissions at Catalog Level (Not Table)

3. Use Service Principals for Automation

4. Document Catalog Purpose

5. Tag Resources for Discovery

Related Documentation