Monitoring and Observability

Overview

This guide covers monitoring and observability for the ML Pipelines project. Effective monitoring enables early detection of issues, performance optimization, and maintaining production reliability.

Monitoring Philosophy

Monitor outcomes, not just outputs: Track business metrics, not just system metrics
Alert on actionable issues: Only alert when human intervention is needed
Use dashboards for exploration: Build dashboards for ad-hoc investigation
Log for debugging: Comprehensive logs enable root cause analysis
Monitor the full stack: Data → Pipelines → Models → Predictions

Databricks Monitoring Overview

Built-in Monitoring

Databricks provides several monitoring capabilities:

DLT System Tables: Pipeline execution metrics and data quality
Job Run History: Job success/failure rates and durations
Model Serving Metrics: Endpoint latency and throughput
Unity Catalog Audit Logs: Access and change tracking
Spark UI: Query performance and resource usage

Pipeline Monitoring

DLT System Tables

Delta Live Tables automatically creates system tables with metrics for all pipeline runs.

Available System Tables:

system.events: All pipeline events (starts, completions, errors)
system.expectations: Data quality check results
system.flow_progress: Row-level processing progress
system.pipeline_updates: Pipeline update history

Monitoring Pipeline Health

Query Pipeline Success Rate:

-- Pipeline success rate over last 7 days
SELECT
  date_trunc('day', timestamp) as date,
  COUNT(*) as total_runs,
  SUM(CASE WHEN state = 'COMPLETED' THEN 1 ELSE 0 END) as successful_runs,
  SUM(CASE WHEN state = 'COMPLETED' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as success_rate
FROM system.pipeline_updates
WHERE pipeline_name = 'sentiment-analysis-prod'
  AND timestamp >= current_date() - INTERVAL 7 DAYS
GROUP BY date_trunc('day', timestamp)
ORDER BY date;

Monitor Pipeline Duration:

-- Average pipeline duration by day
SELECT
  date_trunc('day', timestamp) as date,
  AVG(duration_ms) / 1000 / 60 as avg_duration_minutes,
  MAX(duration_ms) / 1000 / 60 as max_duration_minutes
FROM system.pipeline_updates
WHERE pipeline_name = 'sentiment-analysis-prod'
  AND state = 'COMPLETED'
  AND timestamp >= current_date() - INTERVAL 7 DAYS
GROUP BY date_trunc('day', timestamp)
ORDER BY date;

Monitor Data Quality (Expectations):

-- Expectation failure rates
SELECT
  expectation_name,
  COUNT(*) as total_checks,
  SUM(CASE WHEN failed_count > 0 THEN 1 ELSE 0 END) as failed_checks,
  SUM(failed_count) as total_failures,
  SUM(CASE WHEN failed_count > 0 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as failure_rate
FROM system.expectations
WHERE pipeline_name = 'sentiment-analysis-prod'
  AND timestamp >= current_date() - INTERVAL 7 DAYS
GROUP BY expectation_name
ORDER BY failure_rate DESC;

Alerting on Pipeline Failures

Recommended Alerts:

Pipeline Failure Alert:

-- Trigger alert if pipeline fails
SELECT
  pipeline_name,
  update_id,
  state,
  error_message
FROM system.pipeline_updates
WHERE state = 'FAILED'
  AND timestamp >= current_timestamp() - INTERVAL 15 MINUTES

Data Quality Alert:

-- Trigger alert if expectation failure rate > 5%
SELECT
  expectation_name,
  SUM(failed_count) * 100.0 / SUM(total_count) as failure_rate
FROM system.expectations
WHERE timestamp >= current_timestamp() - INTERVAL 1 HOUR
GROUP BY expectation_name
HAVING failure_rate > 5.0

Pipeline Duration Alert:

-- Alert if pipeline takes > 2x normal duration
SELECT
  pipeline_name,
  update_id,
  duration_ms / 1000 / 60 as duration_minutes
FROM system.pipeline_updates
WHERE state = 'COMPLETED'
  AND timestamp >= current_timestamp() - INTERVAL 1 HOUR
  AND duration_ms > (
    SELECT AVG(duration_ms) * 2
    FROM system.pipeline_updates
    WHERE state = 'COMPLETED'
      AND timestamp >= current_date() - INTERVAL 7 DAYS
  )

Job Monitoring

Job Run Metrics

Query Job Success Rate:

-- Job success rate by day
SELECT
  date_trunc('day', start_time) as date,
  COUNT(*) as total_runs,
  SUM(CASE WHEN result_state = 'SUCCESS' THEN 1 ELSE 0 END) as successful_runs,
  SUM(CASE WHEN result_state = 'SUCCESS' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as success_rate
FROM jobs.job_runs  -- Databricks system table
WHERE job_name LIKE '%register_sentiment_analysis%'
  AND start_time >= current_date() - INTERVAL 7 DAYS
GROUP BY date_trunc('day', start_time)
ORDER BY date;

Monitor Job Duration:

-- Job execution time trends
SELECT
  date_trunc('day', start_time) as date,
  AVG((end_time - start_time) / 1000 / 60) as avg_duration_minutes,
  MAX((end_time - start_time) / 1000 / 60) as max_duration_minutes
FROM jobs.job_runs
WHERE job_name = 'prod_register_sentiment_analysis'
  AND result_state = 'SUCCESS'
  AND start_time >= current_date() - INTERVAL 7 DAYS
GROUP BY date_trunc('day', start_time)
ORDER BY date;

Model Performance Monitoring

Model Serving Metrics

Monitor Endpoint Latency:

-- Model serving latency percentiles
SELECT
  date_trunc('hour', timestamp) as hour,
  COUNT(*) as request_count,
  percentile_approx(execution_time_ms, 0.5) as p50_latency,
  percentile_approx(execution_time_ms, 0.95) as p95_latency,
  percentile_approx(execution_time_ms, 0.99) as p99_latency
FROM model_serving_logs
WHERE endpoint_name = 'sentiment_analysis'
  AND timestamp >= current_timestamp() - INTERVAL 24 HOURS
GROUP BY date_trunc('hour', timestamp)
ORDER BY hour DESC;

Monitor Error Rate:

-- Model serving error rate
SELECT
  date_trunc('hour', timestamp) as hour,
  COUNT(*) as total_requests,
  SUM(CASE WHEN status_code >= 400 THEN 1 ELSE 0 END) as errors,
  SUM(CASE WHEN status_code >= 400 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as error_rate_pct
FROM model_serving_logs
WHERE endpoint_name = 'sentiment_analysis'
  AND timestamp >= current_timestamp() - INTERVAL 24 HOURS
GROUP BY date_trunc('hour', timestamp)
ORDER BY hour DESC;

Model Prediction Quality

Monitor Prediction Distribution:

-- Sentiment distribution over time
SELECT
  date_trunc('day', timestamp) as date,
  sentiment_polarity,
  COUNT(*) as prediction_count,
  AVG(CAST(sentiment_score AS DOUBLE)) as avg_score
FROM prod.gold.sentiment_predictions
WHERE timestamp >= current_date() - INTERVAL 7 DAYS
GROUP BY date_trunc('day', timestamp), sentiment_polarity
ORDER BY date, sentiment_polarity;

Monitor Prediction Confidence:

-- Track average confidence scores
SELECT
  date_trunc('day', timestamp) as date,
  AVG(CAST(overall_confidence AS DOUBLE)) as avg_confidence,
  percentile_approx(CAST(overall_confidence AS DOUBLE), 0.1) as p10_confidence,
  percentile_approx(CAST(overall_confidence AS DOUBLE), 0.5) as p50_confidence
FROM prod.gold.sentiment_predictions
WHERE timestamp >= current_date() - INTERVAL 7 DAYS
GROUP BY date_trunc('day', timestamp)
ORDER BY date;

Detect Data Drift:

-- Compare prediction distributions week-over-week
WITH current_week AS (
  SELECT
    sentiment_polarity,
    COUNT(*) as count
  FROM prod.gold.sentiment_predictions
  WHERE timestamp >= current_date() - INTERVAL 7 DAYS
  GROUP BY sentiment_polarity
),
previous_week AS (
  SELECT
    sentiment_polarity,
    COUNT(*) as count
  FROM prod.gold.sentiment_predictions
  WHERE timestamp BETWEEN current_date() - INTERVAL 14 DAYS AND current_date() - INTERVAL 7 DAYS
  GROUP BY sentiment_polarity
)
SELECT
  c.sentiment_polarity,
  c.count as current_week_count,
  p.count as previous_week_count,
  (c.count - p.count) * 100.0 / p.count as pct_change
FROM current_week c
JOIN previous_week p ON c.sentiment_polarity = p.sentiment_polarity
ORDER BY ABS((c.count - p.count) * 100.0 / p.count) DESC;

Cost Monitoring

Tracking Compute Costs

Monitor DBU Usage (Databricks UI):

Navigate to Account Console
Click Usage → Billable Usage
Filter by workspace, cluster, or job

Query Cluster Usage:

-- Cluster usage by job (approximate)
SELECT
  job_name,
  SUM(duration_ms) / 1000 / 60 / 60 as total_hours,
  COUNT(*) as run_count,
  AVG(duration_ms) / 1000 / 60 as avg_duration_minutes
FROM jobs.job_runs
WHERE start_time >= current_date() - INTERVAL 30 DAYS
  AND result_state = 'SUCCESS'
GROUP BY job_name
ORDER BY total_hours DESC;

Cost Optimization Opportunities

Terminate idle clusters: Set auto-termination to 10-30 minutes
Right-size clusters: Monitor CPU/memory utilization in Spark UI
Use spot instances: Enable spot instances for non-critical workloads
Optimize file sizes: Compact small files in Delta tables
Cache frequent queries: Use Delta cache for hot data

Identify Expensive Jobs:

-- Find longest-running jobs
SELECT
  job_name,
  AVG((end_time - start_time) / 1000 / 60) as avg_duration_minutes,
  MAX((end_time - start_time) / 1000 / 60) as max_duration_minutes,
  COUNT(*) as run_count
FROM jobs.job_runs
WHERE start_time >= current_date() - INTERVAL 30 DAYS
GROUP BY job_name
ORDER BY avg_duration_minutes DESC
LIMIT 10;

Log Aggregation

Centralized Logging Strategy

Current State: Logs are stored in Databricks per job/pipeline.

Recommended Approach (Future Enhancement):

Export logs to S3 for long-term retention
Use AWS CloudWatch or similar for centralized log aggregation
Set up log parsing and indexing for search

Accessing Logs

DLT Pipeline Logs:

UI: Workflows → Delta Live Tables → [Pipeline] → Event Log
CLI: databricks pipelines list-updates <pipeline_id>

Job Logs:

UI: Workflows → Jobs → [Job] → [Run] → Logs
CLI: databricks jobs get-run-output --run-id <run_id>

Model Serving Logs:

UI: Machine Learning → Model Serving → [Endpoint] → Logs

Spark Logs:

UI: Job Run → Spark UI → Executors → Logs

Alert Configuration

Alerting Strategy

Alert Levels:

Critical: Immediate action required (page on-call)
- Production pipeline failure
- Model serving endpoint down
- Data corruption detected
Warning: Investigate soon (Slack notification)
- Pipeline duration > 2x normal
- Error rate > 5%
- Data quality expectations failing
Info: For awareness (email digest)
- Deployment completed
- New model version deployed
- Weekly performance summary

Setting Up Alerts (Databricks SQL)

Create Alert (Databricks UI):

Navigate to SQL → Alerts
Click Create Alert
Define query and trigger conditions
Configure notification channels (email, Slack, PagerDuty)

Example Alert Query:

-- Alert if sentiment pipeline fails
SELECT
  pipeline_name,
  update_id,
  state,
  error_message,
  timestamp
FROM system.pipeline_updates
WHERE pipeline_name = 'sentiment-analysis-prod'
  AND state = 'FAILED'
  AND timestamp >= current_timestamp() - INTERVAL 15 MINUTES

Trigger Condition: "Results returned" (alert if query returns any rows)

Notification: Send to #ml-alerts Slack channel

Dashboard Creation

Recommended Dashboards

1. Pipeline Health Dashboard

Metrics:

Pipeline success rate (last 7 days)
Average pipeline duration
Data quality metrics (expectation failures)
Row counts processed
Error messages (last 10 failures)

Query Examples:

-- Widget 1: Success rate
SELECT
  SUM(CASE WHEN state = 'COMPLETED' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as success_rate
FROM system.pipeline_updates
WHERE pipeline_name = 'sentiment-analysis-prod'
  AND timestamp >= current_date() - INTERVAL 7 DAYS;

-- Widget 2: Latest pipeline run
SELECT
  update_id,
  state,
  duration_ms / 1000 / 60 as duration_minutes,
  timestamp
FROM system.pipeline_updates
WHERE pipeline_name = 'sentiment-analysis-prod'
ORDER BY timestamp DESC
LIMIT 1;

2. Model Performance Dashboard

Metrics:

Request count and throughput
Latency percentiles (p50, p95, p99)
Error rate
Prediction distribution
Confidence scores
Data drift detection

3. Cost Dashboard

Metrics:

Total DBU usage by job
Cluster utilization
Most expensive jobs
Cost trends over time

Key Metrics to Track

Pipeline Metrics

Metric

Description

Target

Alert Threshold

Success Rate

% of successful pipeline runs

>99%

<95%

Duration

Average pipeline execution time

<30 min

>60 min

Data Quality

% of rows passing expectations

>95%

<90%

Throughput

Rows processed per minute

>1000

<500

Model Metrics

Metric

Description

Target

Alert Threshold

Latency (p95)

95th percentile response time

<500ms

>1000ms

Error Rate

% of failed predictions

<1%

>5%

Throughput

Predictions per second

>50

<10

Confidence

Average prediction confidence

>0.80

<0.60

Data Metrics

Metric

Description

Target

Alert Threshold

Null Rate

% of null values in key columns

<1%

>5%

Duplicate Rate

% of duplicate records

>1%

Freshness

Time since last data update

<1 hour

>4 hours

Volume

Daily row count

Stable

>20% change

Production Monitoring Checklist

Before deploying to production, ensure:

Pipeline success/failure alerts configured
Model serving endpoint monitoring enabled
Data quality expectations defined
Cost tracking dashboard created
Log retention policy set
On-call rotation established
Runbooks documented
Escalation procedures defined

Monitoring Tools and Integrations

Databricks Native Tools

Databricks SQL: Ad-hoc queries and dashboards
System Tables: DLT metrics and audit logs
Spark UI: Performance debugging
MLflow: Model tracking and versioning

External Tools (Recommended)

Slack: Real-time alert notifications
PagerDuty: On-call incident management
DataDog / New Relic: APM and infrastructure monitoring
AWS CloudWatch: Log aggregation and metrics
Grafana: Custom dashboards and visualization

Integration Examples

Slack Webhook Alert:

import requests

def send_slack_alert(message):
    """Send alert to Slack channel."""
    webhook_url = dbutils.secrets.get("slack", "webhook_url")
    payload = {
        "text": message,
        "channel": "#ml-alerts",
        "username": "Databricks Alert"
    }
    requests.post(webhook_url, json=payload)

# Usage in notebook
if pipeline_failed:
    send_slack_alert(f"Pipeline {pipeline_name} failed! Check logs.")

Troubleshooting Monitoring Issues

Alerts Not Triggering

Symptoms: Expected alert didn't fire

Debug Steps:

Verify alert query returns results
Check alert trigger condition
Verify notification channel configured
Check alert schedule (is it paused?)

Dashboard Loading Slowly

Symptoms: Dashboard takes >30s to load

Optimizations:

Add date filters to queries (e.g., last 7 days)
Use aggregations instead of row-level data
Cache dashboard results (if available)
Optimize underlying Delta tables (OPTIMIZE, Z-ORDER)

Missing Metrics

Symptoms: Metrics not appearing in system tables

Causes:

System tables not enabled for workspace
Insufficient permissions to read system tables
Metrics not available for older pipeline versions

Solution: Enable system tables in workspace settings

Troubleshooting Guide - Debugging common issues
Deployment Guide - Deployment procedures
CI/CD Pipeline - Automated deployment monitoring
Debugging Guide - Developer debugging workflows

PreviousInfrastructure Management Guide NextService Principal Setup Guide

Last updated 5 months ago

hashtagOverview

hashtagMonitoring Philosophy

hashtagDatabricks Monitoring Overview

hashtagBuilt-in Monitoring

hashtagPipeline Monitoring

hashtagDLT System Tables

hashtagMonitoring Pipeline Health

hashtagAlerting on Pipeline Failures

hashtagJob Monitoring

hashtagJob Run Metrics

hashtagModel Performance Monitoring

hashtagModel Serving Metrics

hashtagModel Prediction Quality

hashtagCost Monitoring

hashtagTracking Compute Costs

hashtagCost Optimization Opportunities

hashtagLog Aggregation

hashtagCentralized Logging Strategy

hashtagAccessing Logs

hashtagAlert Configuration

hashtagAlerting Strategy

hashtagSetting Up Alerts (Databricks SQL)

hashtagDashboard Creation

hashtagRecommended Dashboards

hashtag1. Pipeline Health Dashboard

hashtag2. Model Performance Dashboard

hashtag3. Cost Dashboard

hashtagKey Metrics to Track

hashtagPipeline Metrics

hashtagModel Metrics

hashtagData Metrics

hashtagProduction Monitoring Checklist

hashtagMonitoring Tools and Integrations

hashtagDatabricks Native Tools

hashtagExternal Tools (Recommended)

hashtagIntegration Examples

hashtagTroubleshooting Monitoring Issues

hashtagAlerts Not Triggering

hashtagDashboard Loading Slowly

hashtagMissing Metrics

hashtagRelated Documentation

Overview

Monitoring Philosophy

Databricks Monitoring Overview

Built-in Monitoring

Pipeline Monitoring

DLT System Tables

Monitoring Pipeline Health

Alerting on Pipeline Failures

Job Monitoring

Job Run Metrics

Model Performance Monitoring

Model Serving Metrics

Model Prediction Quality

Cost Monitoring

Tracking Compute Costs

Cost Optimization Opportunities

Log Aggregation

Centralized Logging Strategy

Accessing Logs

Alert Configuration

Alerting Strategy

Setting Up Alerts (Databricks SQL)

Dashboard Creation

Recommended Dashboards

1. Pipeline Health Dashboard

2. Model Performance Dashboard

3. Cost Dashboard

Key Metrics to Track

Pipeline Metrics

Model Metrics

Data Metrics

Production Monitoring Checklist

Monitoring Tools and Integrations

Databricks Native Tools

External Tools (Recommended)

Integration Examples

Troubleshooting Monitoring Issues

Alerts Not Triggering

Dashboard Loading Slowly

Missing Metrics

Related Documentation