Data Flow Architecture

Overview

This document provides detailed documentation of data flow through the ML Pipelines platform, covering the medallion architecture (bronze, silver, gold layers), data lineage, schema evolution, and performance considerations.

Related Topics:

Pipeline Orchestration

The ML Pipelines platform uses automated orchestration to coordinate data processing from raw ingestion through final reporting. The Data Ingestion and Analysis Orchestration job executes a directed acyclic graph (DAG) of 7 tasks across 4 sequential stages.

Orchestration Schedule:

  • Development/Staging: Daily at 2:00 AM UTC

  • Production: Every 6 hours (00:00, 06:00, 12:00, 18:00 UTC)

Pipeline Execution Flow

Stage 1 (Parallel): Data Ingestion
├── bronze_data_ingestion (DLT)      → Bronze layer tables
└── neon_db_replication (Job)        → Bronze reference tables

Stage 2 (Parallel): Feature Extraction
├── emoji_analysis (DLT)              → Silver layer features
├── feature_analysis (DLT)            → Silver layer features
├── sentiment_analysis (DLT)          → Silver layer features
└── linguistic_analysis (DLT)         → Silver layer features

Stage 3: Aggregation
└── psychosocial_analysis (DLT)       → Gold layer aggregations

Stage 4: Reporting
└── risk_analysis_report (DLT)        → Report layer outputs

Key Characteristics:

  • Parallel execution within stages (Stage 1: 2 parallel tasks, Stage 2: 4 parallel tasks)

  • Sequential progression across stages (ensures data dependencies)

  • Automatic retry and timeout handling (2 retries per task, 30-minute timeouts)

  • Typical execution time: 20-33 minutes end-to-end

For detailed orchestration documentation, see Orchestration Job Documentationarrow-up-right.

Medallion Architecture

The platform implements the Delta Lake medallion architecture pattern with three layers:


Bronze Layer: Raw Ingestion

Purpose

The Bronze layer stores raw data exactly as received from source systems with minimal transformation. This provides a historical record and enables reprocessing if needed.

Data Sources

1. External Volumes (S3) - Event/Activity Data:

Ingestion Method: Streaming via Auto Loader (CloudFiles)

2. Neon PostgreSQL - Reference Data:

Reference data replicated from the app-web Neon PostgreSQL database via incremental batch sync:

Ingestion Method: Batch replication via JDBC with incremental sync (watermark-based)

  • Frequency: Every 2-6 hours (configurable per table)

  • PII Protection: Source queries exclude PII fields (name, email) - see ADR-004arrow-up-right

  • Watermark Tracking: bronze.sync_watermarks table tracks last sync timestamp per table

Ingestion Pattern:

Bronze Tables

Schema Structure:

Bronze Layer Characteristics

  • No data validation: Accept all data, even malformed

  • Append-only: Never delete or update

  • Schema evolution: Permit schema changes automatically

  • Audit trail: Track source files and ingestion time

  • Idempotent: Re-running produces same result

Data Quality

Minimal quality checks at bronze:

  • File format validation (JSON, CSV parseable)

  • Required fields present (source file path)

  • Timestamp of ingestion captured


Silver Layer: Cleaned, Validated & Enriched with Model Predictions

Purpose

The Silver layer contains cleaned, validated, and deduplicated data with enforced schemas. This layer also includes model predictions (e.g., sentiment analysis, emoji detection, linguistic features) applied to individual records. This is the foundation for analytics and ML feature engineering.

Transformation Pipeline

Example 1: Basic Cleaning (Silver Messages):

Example 2: Silver with Model Predictions (Sentiment Analysis):

This demonstrates the key silver layer pattern: cleaned data + model predictions on individual records.

Silver Tables

Schema Structure (Enriched with Neon Reference Data):

Enrichment Pattern:

Data Quality Expectations

Silver layer enforces strict quality:

Expectation Types:

Deduplication Strategy

Primary Key Deduplication:

Time-Window Deduplication (for near-duplicates):


Gold Layer: Business-Ready Aggregations

Purpose

The Gold layer contains business-ready datasets optimized for analytics, reporting, and ML training. Gold focuses on aggregations and multi-table joins that combine bronze, silver, and gold data to create trend analysis and business insights. Unlike silver (which adds predictions to individual records), gold combines data across time windows, entities, or dimensions.

Gold Tables

Example: Aggregated Metrics (Joining Silver Predictions):

ML Feature Engineering

Feature Pipeline (Aggregating Multiple Silver Predictions):

Note on AI Query Usage

ai_query is used to call ai models from the serving endpoint, drastically reducing complexity and enabling more efficient batch processing that comes from spark (sql). The gold layer reads predictions from silver, joined with raw metrics from bronze, and aggregates them:


Data Lineage

Pipeline Dependencies

The orchestration job coordinates the following pipeline dependencies:

Table Dependencies

Example lineage for message processing:

Tracking Lineage

Unity Catalog Lineage:

  • Automatically tracked by Delta Live Tables

  • View in Databricks UI: Data Explorer → Table → Lineage tab

  • Shows upstream and downstream dependencies

Column-Level Lineage:


Schema Evolution

Handling Schema Changes

Option 1: Schema Inference (Bronze):

Option 2: Explicit Schema (Silver):

Option 3: Schema Evolution with Merge Schema:

Schema Migration Strategy

Step 1: Add Column (Non-Breaking):

Step 2: Deprecate Column (Breaking):

Step 3: Change Column Type (Breaking):


Data Quality Gates

Quality Framework

Three-Level Quality System:

  1. Bronze: Permissive (log failures, accept all)

  2. Silver: Enforcing (drop invalid records)

  3. Gold: Strict (fail pipeline on quality issues)

Expectation Patterns

Pattern 1: Required Fields:

Pattern 2: Value Ranges:

Pattern 3: Referential Integrity:

Pattern 4: Custom Validation:

Quarantine Tables

Failed records go to quarantine for analysis:


Performance Considerations

Partitioning Strategy

Time-Based Partitioning:

Z-Ordering (for non-partitioned queries):

Streaming Optimizations

Trigger Intervals:

Batch Size Control:

Query Performance

Bloom Filters (for point lookups):

Caching (for repeated queries):


Data Retention

Retention Policies

Bronze: 90 days (regulatory compliance) Silver: 365 days (historical analysis) Gold: 730 days (long-term trends)

Implementing Retention

Automated Cleanup Job:


Monitoring Data Flow

Key Metrics

Pipeline Metrics:

  • Rows scanned per layer

  • Processing latency

  • Data freshness

  • Quality expectation failures

Query:

Alerting

Data Freshness Alert:

Quality Alert:


Last updated