Data Flow Architecture

Overview

This document provides detailed documentation of data flow through the ML Pipelines platform, covering the medallion architecture (bronze, silver, gold layers), data lineage, schema evolution, and performance considerations.

Related Topics:

For Unity Catalog structure and permissions, see Unity Catalog Architecture
For model deployment and ai_query usage, see Model Deployment Guide
For high-level architecture overview, see ARCHITECTURE.md
For orchestration and scheduling, see Orchestration Job Documentation

Pipeline Orchestration

The ML Pipelines platform uses automated orchestration to coordinate data processing from raw ingestion through final reporting. The Data Ingestion and Analysis Orchestration job executes a directed acyclic graph (DAG) of 7 tasks across 4 sequential stages.

Orchestration Schedule:

Development/Staging: Daily at 2:00 AM UTC
Production: Every 6 hours (00:00, 06:00, 12:00, 18:00 UTC)

Pipeline Execution Flow

Stage 1 (Parallel): Data Ingestion
├── bronze_data_ingestion (DLT)      → Bronze layer tables
└── neon_db_replication (Job)        → Bronze reference tables
          ↓
Stage 2 (Parallel): Feature Extraction
├── emoji_analysis (DLT)              → Silver layer features
├── feature_analysis (DLT)            → Silver layer features
├── sentiment_analysis (DLT)          → Silver layer features
└── linguistic_analysis (DLT)         → Silver layer features
          ↓
Stage 3: Aggregation
└── psychosocial_analysis (DLT)       → Gold layer aggregations
          ↓
Stage 4: Reporting
└── risk_analysis_report (DLT)        → Report layer outputs

Key Characteristics:

Parallel execution within stages (Stage 1: 2 parallel tasks, Stage 2: 4 parallel tasks)
Sequential progression across stages (ensures data dependencies)
Automatic retry and timeout handling (2 retries per task, 30-minute timeouts)
Typical execution time: 20-33 minutes end-to-end

For detailed orchestration documentation, see Orchestration Job Documentation.

Medallion Architecture

The platform implements the Delta Lake medallion architecture pattern with three layers:

┌──────────────────────────────────────────────────────────────────┐
│ BRONZE LAYER (Raw Ingestion)                                     │
│ - Raw data from source systems (no cleaning)                     │
│ - Minimal transformation                                         │
│ - Schema-on-read approach                                        │
│ - Append-only operations                                         │
└──────────────────────────────────────────────────────────────────┘
                            ↓
┌──────────────────────────────────────────────────────────────────┐
│ SILVER LAYER (Cleaned, Validated, WITH MODEL PREDICTIONS)        │
│ - Data cleansing and validation                                  │
│ - Schema enforcement                                             │
│ - Deduplication                                                  │
│ - Type conversions                                               │
│ - MODEL PREDICTIONS (e.g., sentiment analysis on messages)       │
│ - Feature engineering                                            │
└──────────────────────────────────────────────────────────────────┘
                            ↓
┌──────────────────────────────────────────────────────────────────┐
│ GOLD LAYER (Business-Ready Aggregations)                         │
│ - Aggregations joining bronze/silver/gold                        │
│ - Trend analysis across multiple data points                     │
│ - Multi-table insights for business metrics                      │
│ - Feature tables for ML training                                 │
└──────────────────────────────────────────────────────────────────┘

Bronze Layer: Raw Ingestion

Purpose

The Bronze layer stores raw data exactly as received from source systems with minimal transformation. This provides a historical record and enables reprocessing if needed.

Data Sources

1. External Volumes (S3) - Event/Activity Data:

s3://ref-ml-core-{env}-workspace-bucket/{env}/volumes/bronze/
├── raw_messages/                       # Emails, messages, comments
├── raw_message_participants/           # User IDs of participants (sender, recipient, cc, etc.)
├── raw_calendar_events/                # Calendar events and meetings (recurring, one-off, etc.)
├── raw_calendar_event_participants/    # User IDs of participants
├── raw_work_items/                     # Work item objects, namely used for metadata (enhancing these through the state changes events)
├── raw_work_item_state_changes/        # History of changes for work items, allowing point-in-time insights into active and closed items
└── raw_work_item_participants/         # User IDs of participants (assigner, assignee, watcher, etc.)

Ingestion Method: Streaming via Auto Loader (CloudFiles)

2. Neon PostgreSQL - Reference Data:

Reference data replicated from the app-web Neon PostgreSQL database via incremental batch sync:

Neon Tables (app-web database):
├── users                           # User accounts (PII-filtered: excludes name, email)
├── user_private                    # Demographics and preferences (with consent)
├── employee_data                   # HRIS data (job info, performance, etc.)
├── groups                          # Group/team definitions
├── group_users                     # Group membership (many-to-many)
├── tenants                         # Organization/tenant data
├── tenant_users                    # Tenant membership (many-to-many)
├── integrations                    # Integration configurations
├── user_reporting_relationships    # Organizational reporting structure
├── user_schedules                  # User working hours
├── policy_versions                 # Privacy policy versions (for future pipeline filtering)
├── tenant_subscriptions            # Subscription/billing data (for future pipeline filtering)
└── user_policy_acceptances         # Policy acceptance records

Ingestion Method: Batch replication via JDBC with incremental sync (watermark-based)

Frequency: Every 2-6 hours (configurable per table)
PII Protection: Source queries exclude PII fields (name, email) - see ADR-004
Watermark Tracking: bronze.sync_watermarks table tracks last sync timestamp per table

Ingestion Pattern:

# DLT pipeline reads from volume
@dlt.table(
    name="messages",
    comment="Raw messages from social media platforms"
)
def bronze_messages():
    return (
        spark.readStream
            .format("cloudFiles")
            .option("cloudFiles.format", "json")
            .option("cloudFiles.schemaLocation", "/Volumes/dev/bronze/_schemas/messages")
            .load("/Volumes/dev/bronze/raw_messages/")
    )

Bronze Tables

Schema Structure:

CREATE TABLE bronze.messages (
    message_id STRING,              -- Original message ID
    platform STRING,                 -- Source platform (twitter, reddit, etc)
    content STRING,                  -- Raw message content
    author_id STRING,                -- Original author ID
    created_date TIMESTAMP,          -- Message creation timestamp
    metadata STRING,                 -- Raw metadata as JSON string
    _ingestion_timestamp TIMESTAMP,  -- When data was ingested
    _source_file STRING              -- Source file path
);

Bronze Layer Characteristics

No data validation: Accept all data, even malformed
Append-only: Never delete or update
Schema evolution: Permit schema changes automatically
Audit trail: Track source files and ingestion time
Idempotent: Re-running produces same result

Data Quality

Minimal quality checks at bronze:

File format validation (JSON, CSV parseable)
Required fields present (source file path)
Timestamp of ingestion captured

Silver Layer: Cleaned, Validated & Enriched with Model Predictions

Purpose

The Silver layer contains cleaned, validated, and deduplicated data with enforced schemas. This layer also includes model predictions (e.g., sentiment analysis, emoji detection, linguistic features) applied to individual records. This is the foundation for analytics and ML feature engineering.

Transformation Pipeline

Example 1: Basic Cleaning (Silver Messages):

@dlt.table(
    name="messages",
    comment="Cleaned and validated messages",
    table_properties={
        "quality": "silver",
        "pipelines.autoOptimize.zOrderCols": "message_id,created_date"
    }
)
@dlt.expect_or_drop("valid_message_id", "message_id IS NOT NULL")
@dlt.expect_or_drop("valid_timestamp", "created_date IS NOT NULL")
@dlt.expect("valid_content_length", "LENGTH(content) > 0 AND LENGTH(content) < 10000")
def silver_messages():
    return (
        dlt.read_stream("LIVE.bronze.messages")
            .select(
                col("message_id"),
                col("platform"),
                # Clean content: remove URLs, mentions, hashtags
                regexp_replace(col("content"), r"http\S+", "").alias("content"),
                col("author_id"),
                col("created_date"),
                # Parse metadata JSON
                from_json(col("metadata"), metadata_schema).alias("metadata"),
                current_timestamp().alias("_processing_timestamp")
            )
            .dropDuplicates(["message_id"])  # Remove duplicates
    )

Example 2: Silver with Model Predictions (Sentiment Analysis):

-- Silver layer enriched with ML model predictions
CREATE OR REPLACE STREAMING LIVE TABLE sentiment_features
COMMENT "Messages enriched with sentiment analysis predictions"
AS
SELECT
  message_id,
  content,
  -- MODEL PREDICTION: Sentiment analysis via ai_query
  ai_query(
    endpoint => 'sentiment_analysis',
    request => content,
    failOnError => false
  ) AS sentiment_result,
  -- Extract prediction fields
  sentiment_result.sentiment AS sentiment,
  CAST(sentiment_result.score AS DECIMAL(10,6)) AS sentiment_score,
  sentiment_result.emotion AS emotion,
  created_date,
  CURRENT_TIMESTAMP() AS _prediction_timestamp
FROM STREAM(LIVE.silver.messages);

This demonstrates the key silver layer pattern: cleaned data + model predictions on individual records.

Silver Tables

Schema Structure (Enriched with Neon Reference Data):

CREATE TABLE silver.messages (
    message_id STRING NOT NULL,
    platform STRING NOT NULL,
    content STRING NOT NULL,
    content_length INT,
    author_id STRING,
    created_date TIMESTAMP NOT NULL,
    language STRING,                    -- Detected language
    is_retweet BOOLEAN,
    metadata STRUCT<...>,               -- Parsed metadata

    -- Enriched from bronze.users (Neon)
    author_timezone STRING,

    -- Enriched from bronze.user_private (Neon)
    author_generation STRING,
    author_communication_directness STRING,
    author_preferred_tone STRING,

    -- Enriched from bronze.employee_data (Neon)
    author_job_level STRING,
    author_department STRING,
    author_work_location STRING,

    -- Enriched from bronze.tenants (Neon)
    tenant_industry STRING,
    tenant_region STRING,

    _processing_timestamp TIMESTAMP,
    CONSTRAINT pk PRIMARY KEY (message_id)
);

Enrichment Pattern:

-- Silver layer joins with Neon reference tables
CREATE OR REPLACE STREAMING LIVE TABLE messages AS
SELECT
    m.*,
    -- User demographics
    u.timezone AS author_timezone,
    up.generation AS author_generation,
    up.communication_directness,
    up.preferred_tone,
    -- Employee data
    ed.job_level AS author_job_level,
    ed.source_department AS author_department,
    ed.work_location AS author_work_location,
    -- Tenant context
    t.industry AS tenant_industry,
    t.region AS tenant_region
FROM STREAM(LIVE.bronze.messages) m
LEFT JOIN LIVE.bronze.users u ON m.user_id = u.id
LEFT JOIN LIVE.bronze.user_private up ON m.user_id = up.user_id
LEFT JOIN LIVE.bronze.employee_data ed ON m.user_id = ed.user_id
LEFT JOIN LIVE.bronze.tenants t ON m.tenant_id = t.id;

Data Quality Expectations

Silver layer enforces strict quality:

Expectation Types:

# Drop rows that don't meet requirement
@dlt.expect_or_drop("valid_message_id", "message_id IS NOT NULL")

# Warn but don't drop
@dlt.expect("reasonable_length", "LENGTH(content) BETWEEN 1 AND 5000")

# Track failures in separate table
@dlt.expect_all({
    "valid_date": "created_date >= '2020-01-01'",
    "valid_platform": "platform IN ('twitter', 'reddit', 'instagram')"
})

Deduplication Strategy

Primary Key Deduplication:

.dropDuplicates(["message_id"])

Time-Window Deduplication (for near-duplicates):

window = Window.partitionBy("author_id", "content_hash").orderBy(col("created_date").desc())
df.withColumn("row_num", row_number().over(window)) \
  .filter(col("row_num") == 1)

Gold Layer: Business-Ready Aggregations

Purpose

The Gold layer contains business-ready datasets optimized for analytics, reporting, and ML training. Gold focuses on aggregations and multi-table joins that combine bronze, silver, and gold data to create trend analysis and business insights. Unlike silver (which adds predictions to individual records), gold combines data across time windows, entities, or dimensions.

Gold Tables

Example: Aggregated Metrics (Joining Silver Predictions):

CREATE TABLE gold.daily_sentiment_metrics (
    metric_date DATE PRIMARY KEY,
    total_messages BIGINT,
    positive_count BIGINT,
    negative_count BIGINT,
    neutral_count BIGINT,
    avg_sentiment_score DECIMAL(10,6),
    avg_toxicity_score DECIMAL(10,6),
    top_entities ARRAY<STRUCT<entity:STRING, count:BIGINT>>,
    _computation_timestamp TIMESTAMP
);

ML Feature Engineering

Feature Pipeline (Aggregating Multiple Silver Predictions):

@dlt.table(name="ml_features")
def gold_ml_features():
    """
    Gold layer: Combine predictions from multiple silver tables
    to create aggregated features for ML training
    """
    messages = dlt.read("LIVE.silver.messages")
    sentiment = dlt.read("LIVE.silver.sentiment_features")  # Silver predictions
    emoji = dlt.read("LIVE.silver.emoji_features")          # Silver predictions
    linguistic = dlt.read("LIVE.silver.linguistic_features") # Silver predictions

    return (
        messages
            .join(sentiment, "message_id")
            .join(emoji, "message_id")
            .join(linguistic, "message_id")
            .select(
                col("message_id"),
                # Text features
                length(col("content")).alias("content_length"),
                size(split(col("content"), " ")).alias("word_count"),
                # Sentiment features (from silver.sentiment_features)
                col("sentiment"),
                col("sentiment_score"),
                col("emotion"),
                # Emoji features (from silver.emoji_features)
                col("emoji_count"),
                col("emoji_sentiment"),
                # Linguistic features (from silver.linguistic_features)
                col("lexical_diversity"),
                col("avg_word_length"),
                # Time features
                hour(col("created_date")).alias("hour_of_day"),
                dayofweek(col("created_date")).alias("day_of_week")
            )
    )

Note on AI Query Usage

ai_query is used to call ai models from the serving endpoint, drastically reducing complexity and enabling more efficient batch processing that comes from spark (sql). The gold layer reads predictions from silver, joined with raw metrics from bronze, and aggregates them:

-- SILVER: Individual record predictions
CREATE OR REPLACE STREAMING LIVE TABLE sentiment_features AS  -- This is SILVER
SELECT
    message_id,
    ai_query('sentiment_analysis', content) AS sentiment_result,
    sentiment_result.sentiment AS sentiment
FROM STREAM(LIVE.silver.messages);

-- GOLD: Aggregations of silver predictions
CREATE OR REPLACE LIVE TABLE daily_sentiment_summary AS  -- This is GOLD
SELECT
    DATE_TRUNC('day', created_date) AS day,
    sentiment,
    COUNT(*) AS message_count,
    AVG(sentiment_score) AS avg_score
FROM LIVE.silver.sentiment_features  -- Read from silver
GROUP BY day, sentiment;

Data Lineage

Pipeline Dependencies

The orchestration job coordinates the following pipeline dependencies:

Data Ingestion (Stage 1)
├── bronze_data_ingestion_pipeline
│   └── bronze.messages, bronze.calendar_events, bronze.work_items, etc.
└── neon_db_replication_job
    └── bronze.users, bronze.tenants, bronze.employee_data, etc.
        ↓
Feature Extraction (Stage 2)
├── emoji_analysis_pipeline → silver.emoji_features
├── feature_analysis_pipeline → silver.communication_features
├── sentiment_analysis_pipeline → silver.sentiment_features
└── linguistic_analysis_pipeline → silver.linguistic_features
        ↓
Aggregation (Stage 3)
└── psychosocial_analysis_pipeline → gold.psychosocial_features
        ↓
Reporting (Stage 4)
└── risk_analysis_report_pipeline → reports.risk_analysis

Table Dependencies

Example lineage for message processing:

bronze.messages
    ↓
silver.messages
    ↓
    ├─→ silver.sentiment_features (Stage 2: sentiment_analysis_pipeline)
    │       ↓
    │       └─→ gold.psychosocial_features (Stage 3: psychosocial_analysis_pipeline)
    │               ↓
    │               └─→ reports.risk_analysis (Stage 4: risk_analysis_report_pipeline)
    │
    ├─→ silver.emoji_features (Stage 2: emoji_analysis_pipeline)
    │
    ├─→ silver.communication_features (Stage 2: feature_analysis_pipeline)
    │
    └─→ silver.linguistic_features (Stage 2: linguistic_analysis_pipeline)

Tracking Lineage

Unity Catalog Lineage:

Automatically tracked by Delta Live Tables
View in Databricks UI: Data Explorer → Table → Lineage tab
Shows upstream and downstream dependencies

Column-Level Lineage:

-- Query system tables for lineage
SELECT
    source_table_full_name,
    source_column_name,
    target_table_full_name,
    target_column_name
FROM system.access.column_lineage
WHERE target_table_full_name = 'prod.gold.sentiment_features';

Schema Evolution

Handling Schema Changes

Option 1: Schema Inference (Bronze):

# Automatically adapt to new columns
spark.readStream
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
    .option("cloudFiles.inferColumnTypes", "true")

Option 2: Explicit Schema (Silver):

# Define schema explicitly for validation
schema = StructType([
    StructField("message_id", StringType(), False),
    StructField("content", StringType(), False),
    StructField("created_date", TimestampType(), False)
])

df = df.select([col(c) for c in schema.fieldNames()])

Option 3: Schema Evolution with Merge Schema:

-- Add new column to existing table
ALTER TABLE silver.messages ADD COLUMN language_confidence DECIMAL(10,2);

-- Enable merge schema for future writes
SET spark.databricks.delta.schema.autoMerge.enabled = true;

Schema Migration Strategy

Step 1: Add Column (Non-Breaking):

-- Add new optional column
ALTER TABLE silver.messages ADD COLUMN sentiment STRING;

-- Backfill with null handling
UPDATE silver.messages SET sentiment = 'unknown' WHERE sentiment IS NULL;

Step 2: Deprecate Column (Breaking):

-- Mark as deprecated (comment)
ALTER TABLE silver.messages ALTER COLUMN old_field COMMENT 'DEPRECATED: Use new_field instead';

-- After grace period, drop column
ALTER TABLE silver.messages DROP COLUMN old_field;

Step 3: Change Column Type (Breaking):

-- Add new column with new type
ALTER TABLE silver.messages ADD COLUMN score_v2 DECIMAL(10,6);

-- Backfill data
UPDATE silver.messages SET score_v2 = CAST(score AS DECIMAL(10,6));

-- Drop old column
ALTER TABLE silver.messages DROP COLUMN score;

-- Rename new column
ALTER TABLE silver.messages RENAME COLUMN score_v2 TO score;

Data Quality Gates

Quality Framework

Three-Level Quality System:

Bronze: Permissive (log failures, accept all)
Silver: Enforcing (drop invalid records)
Gold: Strict (fail pipeline on quality issues)

Expectation Patterns

Pattern 1: Required Fields:

@dlt.expect_or_drop("required_fields",
    "message_id IS NOT NULL AND content IS NOT NULL AND created_date IS NOT NULL")

Pattern 2: Value Ranges:

@dlt.expect("valid_score_range",
    "sentiment_score BETWEEN 0 AND 1")

Pattern 3: Referential Integrity:

@dlt.expect_all_or_drop({
    "valid_platform": "platform IN (SELECT platform_id FROM LIVE.ref_platforms)",
    "valid_author": "author_id IN (SELECT author_id FROM LIVE.ref_authors)"
})

Pattern 4: Custom Validation:

@dlt.expect("content_quality",
    """
    LENGTH(content) > 10 AND
    content NOT LIKE '%test%' AND
    content NOT RLIKE '[\\x00-\\x1F]'  -- No control characters
    """)

Quarantine Tables

Failed records go to quarantine for analysis:

@dlt.table(name="messages_quarantine")
def quarantine_messages():
    return (
        dlt.read("LIVE.bronze.messages")
            .filter("NOT (message_id IS NOT NULL AND content IS NOT NULL)")
            .select(
                "*",
                lit("failed_validation").alias("quarantine_reason"),
                current_timestamp().alias("quarantine_timestamp")
            )
    )

Performance Considerations

Partitioning Strategy

Time-Based Partitioning:

CREATE TABLE silver.messages
PARTITIONED BY (created_date DATE)
AS SELECT ...;

-- Queries filter by partition
SELECT * FROM silver.messages
WHERE created_date = '2025-10-03';  -- Efficient: partition pruning

Z-Ordering (for non-partitioned queries):

OPTIMIZE silver.messages
ZORDER BY (message_id, author_id);

Streaming Optimizations

Trigger Intervals:

# Continuous processing (low latency)
.trigger(processingTime="10 seconds")

# Micro-batch (balanced)
.trigger(processingTime="5 minutes")

# Once (batch-like)
.trigger(once=True)

Batch Size Control:

# Limit rows per trigger
spark.conf.set("spark.sql.streaming.maxRowsPerTrigger", "10000")

Query Performance

Bloom Filters (for point lookups):

CREATE BLOOMFILTER INDEX ON silver.messages FOR COLUMNS (message_id);

-- Fast point lookups
SELECT * FROM silver.messages WHERE message_id = 'msg_12345';

Caching (for repeated queries):

CACHE TABLE silver.messages;

-- Or programmatically
df.cache()

Data Retention

Retention Policies

Bronze: 90 days (regulatory compliance) Silver: 365 days (historical analysis) Gold: 730 days (long-term trends)

Implementing Retention

Automated Cleanup Job:

from datetime import datetime, timedelta

# Delete old bronze data
retention_date = (datetime.now() - timedelta(days=90)).strftime('%Y-%m-%d')

spark.sql(f"""
    DELETE FROM bronze.messages
    WHERE created_date < '{retention_date}'
""")

# Vacuum to reclaim storage
spark.sql("VACUUM bronze.messages RETAIN 168 HOURS")

Monitoring Data Flow

Key Metrics

Pipeline Metrics:

Rows scanned per layer
Processing latency
Data freshness
Quality expectation failures

Query:

SELECT
    pipeline_name,
    layer,
    COUNT(*) as row_count,
    MAX(timestamp) as latest_data,
    CURRENT_TIMESTAMP() - MAX(timestamp) as data_lag
FROM system.compute.pipeline_metrics
WHERE pipeline_name = 'sentiment_analysis'
GROUP BY pipeline_name, layer;

Alerting

Data Freshness Alert:

-- Alert if data older than 1 hour
SELECT
    table_name,
    MAX(created_date) as latest_data,
    CURRENT_TIMESTAMP() - MAX(created_date) as lag
FROM prod.silver.messages
HAVING lag > INTERVAL 1 HOUR;

Quality Alert:

-- Alert if expectation failure rate > 5%
SELECT
    expectation_name,
    COUNT(*) as failure_count,
    COUNT(*) * 100.0 / total_rows as failure_rate
FROM system.compute.expectation_failures
WHERE failure_rate > 5.0;

ARCHITECTURE.md - Overall architecture
DLT Pipelines Guide - Pipeline development
Unity Catalog Architecture - Catalog structure
Monitoring Guide - Observability setup
Troubleshooting Guide - Data flow issues

PreviousCI/CD Pipeline Architecture NextModel Promotion Architecture

Last updated 3 months ago

hashtagOverview

hashtagPipeline Orchestration

hashtagPipeline Execution Flow

hashtagMedallion Architecture

hashtagBronze Layer: Raw Ingestion

hashtagPurpose

hashtagData Sources

hashtagBronze Tables

hashtagBronze Layer Characteristics

hashtagData Quality

hashtagSilver Layer: Cleaned, Validated & Enriched with Model Predictions

hashtagPurpose

hashtagTransformation Pipeline

hashtagSilver Tables

hashtagData Quality Expectations

hashtagDeduplication Strategy

hashtagGold Layer: Business-Ready Aggregations

hashtagPurpose

hashtagGold Tables

hashtagML Feature Engineering

hashtagNote on AI Query Usage

hashtagData Lineage

hashtagPipeline Dependencies

hashtagTable Dependencies

hashtagTracking Lineage

hashtagSchema Evolution

hashtagHandling Schema Changes

hashtagSchema Migration Strategy

hashtagData Quality Gates

hashtagQuality Framework

hashtagExpectation Patterns

hashtagQuarantine Tables

hashtagPerformance Considerations

hashtagPartitioning Strategy

hashtagStreaming Optimizations

hashtagQuery Performance

hashtagData Retention

hashtagRetention Policies

hashtagImplementing Retention

hashtagMonitoring Data Flow

hashtagKey Metrics

hashtagAlerting

hashtagRelated Documentation

Overview

Pipeline Orchestration

Pipeline Execution Flow

Medallion Architecture

Bronze Layer: Raw Ingestion

Purpose

Data Sources

Bronze Tables

Bronze Layer Characteristics

Data Quality

Silver Layer: Cleaned, Validated & Enriched with Model Predictions

Purpose

Transformation Pipeline

Silver Tables

Data Quality Expectations

Deduplication Strategy

Gold Layer: Business-Ready Aggregations

Purpose

Gold Tables

ML Feature Engineering

Note on AI Query Usage

Data Lineage

Pipeline Dependencies

Table Dependencies

Tracking Lineage

Schema Evolution

Handling Schema Changes

Schema Migration Strategy

Data Quality Gates

Quality Framework

Expectation Patterns

Quarantine Tables

Performance Considerations

Partitioning Strategy

Streaming Optimizations

Query Performance

Data Retention

Retention Policies

Implementing Retention

Monitoring Data Flow

Key Metrics

Alerting

Related Documentation