Monitoring and Observability

Overview

This guide covers monitoring and observability for the ML Pipelines project. Effective monitoring enables early detection of issues, performance optimization, and maintaining production reliability.

Monitoring Philosophy

  1. Monitor outcomes, not just outputs: Track business metrics, not just system metrics

  2. Alert on actionable issues: Only alert when human intervention is needed

  3. Use dashboards for exploration: Build dashboards for ad-hoc investigation

  4. Log for debugging: Comprehensive logs enable root cause analysis

  5. Monitor the full stack: Data → Pipelines → Models → Predictions

Databricks Monitoring Overview

Built-in Monitoring

Databricks provides several monitoring capabilities:

  • DLT System Tables: Pipeline execution metrics and data quality

  • Job Run History: Job success/failure rates and durations

  • Model Serving Metrics: Endpoint latency and throughput

  • Unity Catalog Audit Logs: Access and change tracking

  • Spark UI: Query performance and resource usage


Pipeline Monitoring

DLT System Tables

Delta Live Tables automatically creates system tables with metrics for all pipeline runs.

Available System Tables:

  • system.events: All pipeline events (starts, completions, errors)

  • system.expectations: Data quality check results

  • system.flow_progress: Row-level processing progress

  • system.pipeline_updates: Pipeline update history

Monitoring Pipeline Health

Query Pipeline Success Rate:

Monitor Pipeline Duration:

Monitor Data Quality (Expectations):

Alerting on Pipeline Failures

Recommended Alerts:

  1. Pipeline Failure Alert:

  2. Data Quality Alert:

  3. Pipeline Duration Alert:


Job Monitoring

Job Run Metrics

Query Job Success Rate:

Monitor Job Duration:


Model Performance Monitoring

Model Serving Metrics

Monitor Endpoint Latency:

Monitor Error Rate:

Model Prediction Quality

Monitor Prediction Distribution:

Monitor Prediction Confidence:

Detect Data Drift:


Cost Monitoring

Tracking Compute Costs

Monitor DBU Usage (Databricks UI):

  1. Navigate to Account Console

  2. Click UsageBillable Usage

  3. Filter by workspace, cluster, or job

Query Cluster Usage:

Cost Optimization Opportunities

  1. Terminate idle clusters: Set auto-termination to 10-30 minutes

  2. Right-size clusters: Monitor CPU/memory utilization in Spark UI

  3. Use spot instances: Enable spot instances for non-critical workloads

  4. Optimize file sizes: Compact small files in Delta tables

  5. Cache frequent queries: Use Delta cache for hot data

Identify Expensive Jobs:


Log Aggregation

Centralized Logging Strategy

Current State: Logs are stored in Databricks per job/pipeline.

Recommended Approach (Future Enhancement):

  • Export logs to S3 for long-term retention

  • Use AWS CloudWatch or similar for centralized log aggregation

  • Set up log parsing and indexing for search

Accessing Logs

DLT Pipeline Logs:

  • UI: Workflows → Delta Live Tables → [Pipeline] → Event Log

  • CLI: databricks pipelines list-updates <pipeline_id>

Job Logs:

  • UI: Workflows → Jobs → [Job] → [Run] → Logs

  • CLI: databricks jobs get-run-output --run-id <run_id>

Model Serving Logs:

  • UI: Machine Learning → Model Serving → [Endpoint] → Logs

Spark Logs:

  • UI: Job Run → Spark UI → Executors → Logs


Alert Configuration

Alerting Strategy

Alert Levels:

  1. Critical: Immediate action required (page on-call)

    • Production pipeline failure

    • Model serving endpoint down

    • Data corruption detected

  2. Warning: Investigate soon (Slack notification)

    • Pipeline duration > 2x normal

    • Error rate > 5%

    • Data quality expectations failing

  3. Info: For awareness (email digest)

    • Deployment completed

    • New model version deployed

    • Weekly performance summary

Setting Up Alerts (Databricks SQL)

Create Alert (Databricks UI):

  1. Navigate to SQLAlerts

  2. Click Create Alert

  3. Define query and trigger conditions

  4. Configure notification channels (email, Slack, PagerDuty)

Example Alert Query:

Trigger Condition: "Results returned" (alert if query returns any rows)

Notification: Send to #ml-alerts Slack channel


Dashboard Creation

1. Pipeline Health Dashboard

Metrics:

  • Pipeline success rate (last 7 days)

  • Average pipeline duration

  • Data quality metrics (expectation failures)

  • Row counts processed

  • Error messages (last 10 failures)

Query Examples:

2. Model Performance Dashboard

Metrics:

  • Request count and throughput

  • Latency percentiles (p50, p95, p99)

  • Error rate

  • Prediction distribution

  • Confidence scores

  • Data drift detection

3. Cost Dashboard

Metrics:

  • Total DBU usage by job

  • Cluster utilization

  • Most expensive jobs

  • Cost trends over time


Key Metrics to Track

Pipeline Metrics

Metric
Description
Target
Alert Threshold

Success Rate

% of successful pipeline runs

>99%

<95%

Duration

Average pipeline execution time

<30 min

>60 min

Data Quality

% of rows passing expectations

>95%

<90%

Throughput

Rows processed per minute

>1000

<500

Model Metrics

Metric
Description
Target
Alert Threshold

Latency (p95)

95th percentile response time

<500ms

>1000ms

Error Rate

% of failed predictions

<1%

>5%

Throughput

Predictions per second

>50

<10

Confidence

Average prediction confidence

>0.80

<0.60

Data Metrics

Metric
Description
Target
Alert Threshold

Null Rate

% of null values in key columns

<1%

>5%

Duplicate Rate

% of duplicate records

0%

>1%

Freshness

Time since last data update

<1 hour

>4 hours

Volume

Daily row count

Stable

>20% change


Production Monitoring Checklist

Before deploying to production, ensure:


Monitoring Tools and Integrations

Databricks Native Tools

  • Databricks SQL: Ad-hoc queries and dashboards

  • System Tables: DLT metrics and audit logs

  • Spark UI: Performance debugging

  • MLflow: Model tracking and versioning

  • Slack: Real-time alert notifications

  • PagerDuty: On-call incident management

  • DataDog / New Relic: APM and infrastructure monitoring

  • AWS CloudWatch: Log aggregation and metrics

  • Grafana: Custom dashboards and visualization

Integration Examples

Slack Webhook Alert:


Troubleshooting Monitoring Issues

Alerts Not Triggering

Symptoms: Expected alert didn't fire

Debug Steps:

  1. Verify alert query returns results

  2. Check alert trigger condition

  3. Verify notification channel configured

  4. Check alert schedule (is it paused?)

Dashboard Loading Slowly

Symptoms: Dashboard takes >30s to load

Optimizations:

  • Add date filters to queries (e.g., last 7 days)

  • Use aggregations instead of row-level data

  • Cache dashboard results (if available)

  • Optimize underlying Delta tables (OPTIMIZE, Z-ORDER)

Missing Metrics

Symptoms: Metrics not appearing in system tables

Causes:

  • System tables not enabled for workspace

  • Insufficient permissions to read system tables

  • Metrics not available for older pipeline versions

Solution: Enable system tables in workspace settings


Last updated