Testing Guide

Overview

This guide covers testing strategies and best practices for the ML Pipelines project. Testing is essential for maintaining code quality, catching bugs early, and ensuring reliable production deployments.

Testing Philosophy

Test Pyramid

           ┌─────────────────┐
           │   E2E Tests     │  ← Few, slow, expensive
           │   (Manual)      │
           └─────────────────┘
         ┌───────────────────────┐
         │  Integration Tests    │  ← Some, medium speed
         │  (DLT, Model, API)    │
         └───────────────────────┘
    ┌──────────────────────────────────┐
    │       Unit Tests                 │  ← Many, fast, cheap
    │  (Functions, Classes, Logic)     │
    └──────────────────────────────────┘

Testing Principles

Write tests before production: All code merged to main should have tests
Test behavior, not implementation: Focus on what code does, not how
Keep tests fast: Unit tests should run in seconds
Make tests deterministic: No flaky tests, no random failures
Test edge cases: Handle nulls, empty strings, boundary values
Mock external dependencies: Don't hit production APIs in tests

Current Test Status

Note: As of October 2025, this repository has minimal test coverage. Tests are being added incrementally as features are developed.

Existing Tests:

/tests/main_test.py: Basic smoke test (122 lines)
/tests/conftest.py: Pytest configuration (2002 lines)

Test Coverage Goal: Target 80% code coverage for critical paths.

Unit Testing

What to Unit Test

Model logic: Prediction functions, preprocessing, postprocessing
Utility functions: Data transformations, validation, formatting
Configuration parsing: YAML loading, parameter validation
Data schemas: Table schema definitions, type conversions

Unit Test Structure

Directory Structure:

tests/
├── conftest.py                     # Pytest fixtures
├── unit/
│   ├── models/
│   │   ├── test_sentiment_model.py
│   │   ├── test_emoji_model.py
│   │   └── test_linguistic_model.py
│   ├── utils/
│   │   ├── test_data_validation.py
│   │   └── test_text_processing.py
│   └── config/
│       └── test_schema_loader.py

Example Unit Test

Testing a Model Class:

# tests/unit/models/test_sentiment_model.py
import pytest
from models.internal.sentiment_analysis.model import SentimentAnalysisModel

class TestSentimentAnalysisModel:
    """Unit tests for SentimentAnalysisModel"""

    @pytest.fixture
    def model(self):
        """Create model instance for testing"""
        return SentimentAnalysisModel(catalog_name="test_catalog")

    def test_predict_single_text(self, model):
        """Test prediction on single text input"""
        # Arrange
        text = "I love this product!"

        # Act
        result = model.predict(text)

        # Assert
        assert isinstance(result, list)
        assert len(result) > 0
        assert 'sentiment' in result[0]
        assert result[0]['sentiment'] in ['positive', 'negative', 'neutral']

    def test_predict_empty_string(self, model):
        """Test prediction handles empty string"""
        # Arrange
        text = ""

        # Act
        result = model.predict(text)

        # Assert
        assert isinstance(result, list)
        # Should return neutral or handle gracefully

    def test_predict_none_input(self, model):
        """Test prediction handles None input"""
        # Arrange
        text = None

        # Act & Assert
        with pytest.raises(ValueError, match="Input text cannot be None"):
            model.predict(text)

    def test_predict_batch(self, model):
        """Test prediction on batch of texts"""
        # Arrange
        texts = [
            "I love this!",
            "This is terrible.",
            "It's okay, I guess."
        ]

        # Act
        results = model.predict(texts)

        # Assert
        assert isinstance(results, list)
        assert len(results) == 3
        for result in results:
            assert 'sentiment' in result
            assert 'score' in result

Testing Utility Functions:

# tests/unit/utils/test_text_processing.py
import pytest
from utils.text_processing import clean_text, tokenize, remove_emojis

class TestTextProcessing:
    """Unit tests for text processing utilities"""

    def test_clean_text_removes_urls(self):
        """Test URL removal from text"""
        # Arrange
        text = "Check out https://example.com for more info"

        # Act
        cleaned = clean_text(text)

        # Assert
        assert "https://example.com" not in cleaned
        assert "Check out" in cleaned

    def test_clean_text_handles_none(self):
        """Test clean_text handles None input"""
        # Act & Assert
        with pytest.raises(TypeError):
            clean_text(None)

    @pytest.mark.parametrize("input_text,expected", [
        ("Hello world", ["Hello", "world"]),
        ("", []),
        ("one", ["one"]),
        ("  multiple   spaces  ", ["multiple", "spaces"]),
    ])
    def test_tokenize(self, input_text, expected):
        """Test tokenization with various inputs"""
        # Act
        result = tokenize(input_text)

        # Assert
        assert result == expected

    def test_remove_emojis(self):
        """Test emoji removal"""
        # Arrange
        text = "I love this! 😀 🎉"

        # Act
        cleaned = remove_emojis(text)

        # Assert
        assert "😀" not in cleaned
        assert "🎉" not in cleaned
        assert "I love this!" in cleaned

Running Unit Tests

Run all tests:

uv run pytest

Run specific test file:

uv run pytest tests/unit/models/test_sentiment_model.py

Run specific test:

uv run pytest tests/unit/models/test_sentiment_model.py::TestSentimentAnalysisModel::test_predict_single_text

Run with coverage:

uv run pytest --cov=src --cov-report=html

Run with verbose output:

uv run pytest -v

Integration Testing

What to Integration Test

DLT pipelines: End-to-end pipeline execution
Model registration: MLflow registration and endpoint creation
API endpoints: Model serving endpoint responses
Database operations: Table creation, data insertion, queries

DLT Pipeline Testing

Testing Strategy:

Create test data in sandbox environment
Run DLT pipeline on test data
Validate output tables have expected schema and data

Example DLT Pipeline Test:

# tests/integration/pipelines/test_sentiment_pipeline.py
import pytest
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.pipelines import PipelineStateInfo

class TestSentimentPipeline:
    """Integration tests for sentiment analysis pipeline"""

    @pytest.fixture
    def client(self):
        """Databricks workspace client"""
        return WorkspaceClient()

    @pytest.fixture
    def pipeline_id(self):
        """Pipeline ID for testing"""
        return "your-pipeline-id"

    def test_pipeline_runs_successfully(self, client, pipeline_id):
        """Test pipeline completes without errors"""
        # Act
        update = client.pipelines.start_update(pipeline_id=pipeline_id)

        # Wait for completion (with timeout)
        import time
        timeout = 600  # 10 minutes
        start_time = time.time()

        while time.time() - start_time < timeout:
            status = client.pipelines.get_update(
                pipeline_id=pipeline_id,
                update_id=update.update_id
            )

            if status.update.state in [
                PipelineStateInfo.COMPLETED,
                PipelineStateInfo.FAILED,
                PipelineStateInfo.CANCELED
            ]:
                break

            time.sleep(10)

        # Assert
        assert status.update.state == PipelineStateInfo.COMPLETED

    def test_pipeline_output_schema(self, client):
        """Test output table has expected schema"""
        # Arrange
        expected_columns = [
            'message_id',
            'sentiment_polarity',
            'sentiment_score',
            'emotional_expressiveness'
        ]

        # Act
        # Query output table schema
        result = client.sql.execute(
            "DESCRIBE TABLE taylorlaing_sandbox.silver.sentiment_analysis"
        )

        # Assert
        columns = [row[0] for row in result.fetchall()]
        for col in expected_columns:
            assert col in columns

    def test_pipeline_data_quality(self, client):
        """Test output data meets quality expectations"""
        # Act
        result = client.sql.execute("""
            SELECT
                COUNT(*) as total_rows,
                COUNT(DISTINCT message_id) as unique_messages,
                AVG(CAST(sentiment_score AS DOUBLE)) as avg_score,
                SUM(CASE WHEN sentiment_polarity IS NULL THEN 1 ELSE 0 END) as null_sentiments
            FROM taylorlaing_sandbox.silver.sentiment_analysis
        """)

        row = result.fetchone()

        # Assert
        assert row['total_rows'] > 0, "Pipeline produced no rows"
        assert row['unique_messages'] == row['total_rows'], "Duplicate message_ids found"
        assert 0 <= row['avg_score'] <= 1, "Scores out of expected range"
        assert row['null_sentiments'] == 0, "Found NULL sentiment values"

Model Registration Testing

Testing Strategy:

Register model to sandbox catalog
Validate model metadata
Test model prediction via MLflow
Test model serving endpoint

Example Model Registration Test:

# tests/integration/models/test_model_registration.py
import pytest
import mlflow
from mlflow.tracking import MlflowClient

class TestModelRegistration:
    """Integration tests for model registration"""

    @pytest.fixture
    def client(self):
        """MLflow client"""
        mlflow.set_registry_uri("databricks-uc")
        return MlflowClient()

    @pytest.fixture
    def model_name(self):
        """Test model name"""
        return "taylorlaing_sandbox.models.sentiment_test"

    def test_model_registration(self, client, model_name):
        """Test model can be registered"""
        # Arrange
        from models.internal.sentiment_analysis.model import SentimentAnalysisModel
        model = SentimentAnalysisModel("taylorlaing_sandbox")

        # Act
        with mlflow.start_run():
            mlflow.pyfunc.log_model(
                python_model=model,
                artifact_path="model",
                registered_model_name=model_name
            )

        # Assert
        versions = client.search_model_versions(f"name='{model_name}'")
        assert len(versions) > 0

    def test_model_prediction(self, client, model_name):
        """Test registered model can make predictions"""
        # Arrange
        model_uri = f"models:/{model_name}@champion"

        # Act
        model = mlflow.pyfunc.load_model(model_uri)
        result = model.predict("I love this product!")

        # Assert
        assert isinstance(result, list)
        assert len(result) > 0

    def test_model_signature(self, client, model_name):
        """Test model has valid signature"""
        # Act
        model_version = client.get_model_version_by_alias(model_name, "champion")
        model_info = client.get_model_version(model_name, model_version.version)

        # Assert
        assert model_info.signature is not None
        assert model_info.signature.inputs is not None
        assert model_info.signature.outputs is not None

Running Integration Tests

Run all integration tests:

uv run pytest tests/integration/

Run specific integration test:

uv run pytest tests/integration/pipelines/test_sentiment_pipeline.py

Note: Integration tests require:

Databricks authentication configured
Access to sandbox environment
Test data in sandbox catalog

Model Testing

Model Validation

What to Validate:

Model loads successfully from MLflow
Model signature is correct
Model predictions have expected format
Model handles edge cases (empty input, nulls, special characters)
Model performance meets baseline metrics

Example Model Validation:

# tests/models/test_model_validation.py
import pytest
import mlflow
from mlflow.models import validate_serving_input

class TestModelValidation:
    """Validation tests for registered models"""

    @pytest.fixture
    def model_uri(self):
        """Model URI for testing"""
        return "models:/taylorlaing_sandbox.models.sentiment_analysis@champion"

    def test_model_loads(self, model_uri):
        """Test model can be loaded from registry"""
        # Act & Assert (should not raise exception)
        model = mlflow.pyfunc.load_model(model_uri)
        assert model is not None

    def test_model_signature_validation(self, model_uri):
        """Test model input matches signature"""
        # Arrange
        valid_input = "This is a test message"

        # Act & Assert (should not raise exception)
        validate_serving_input(model_uri, valid_input)

    def test_model_prediction_format(self, model_uri):
        """Test model output has expected format"""
        # Arrange
        model = mlflow.pyfunc.load_model(model_uri)
        test_input = "I love this product!"

        # Act
        result = model.predict(test_input)

        # Assert
        assert isinstance(result, list)
        assert len(result) > 0

        prediction = result[0]
        assert 'sentiment_polarity' in prediction
        assert 'sentiment_score' in prediction
        assert 'emotional_expressiveness' in prediction

    @pytest.mark.parametrize("input_text", [
        "",  # Empty string
        "   ",  # Whitespace only
        "a" * 10000,  # Very long text
        "🎉🎉🎉",  # Emojis only
        "Hello\nWorld\n\n",  # Newlines
        "Test with @mentions and #hashtags",  # Special tokens
    ])
    def test_model_handles_edge_cases(self, model_uri, input_text):
        """Test model handles various edge cases"""
        # Arrange
        model = mlflow.pyfunc.load_model(model_uri)

        # Act
        result = model.predict(input_text)

        # Assert - Should not crash, even if prediction is neutral
        assert isinstance(result, list)

    def test_model_performance_baseline(self, model_uri):
        """Test model meets performance baseline"""
        # Arrange
        model = mlflow.pyfunc.load_model(model_uri)
        test_cases = [
            ("I absolutely love this!", "positive"),
            ("This is terrible and I hate it!", "negative"),
            ("It's okay, nothing special.", "neutral"),
        ]

        # Act
        correct = 0
        for text, expected_sentiment in test_cases:
            result = model.predict(text)
            predicted_sentiment = result[0]['sentiment_polarity']
            if predicted_sentiment == expected_sentiment:
                correct += 1

        accuracy = correct / len(test_cases)

        # Assert
        assert accuracy >= 0.66, f"Model accuracy {accuracy} below baseline 0.66"

Performance Testing

Latency Testing:

# tests/performance/test_model_latency.py
import pytest
import time
import mlflow

class TestModelPerformance:
    """Performance tests for model inference"""

    @pytest.fixture
    def model_uri(self):
        return "models:/taylorlaing_sandbox.models.sentiment_analysis@champion"

    def test_prediction_latency(self, model_uri):
        """Test single prediction latency"""
        # Arrange
        model = mlflow.pyfunc.load_model(model_uri)
        test_input = "This is a test message for latency testing."

        # Warmup
        model.predict(test_input)

        # Act
        start_time = time.time()
        result = model.predict(test_input)
        latency = time.time() - start_time

        # Assert
        assert latency < 1.0, f"Prediction latency {latency}s exceeds 1s threshold"

    def test_batch_prediction_throughput(self, model_uri):
        """Test batch prediction throughput"""
        # Arrange
        model = mlflow.pyfunc.load_model(model_uri)
        batch_size = 100
        test_inputs = [f"Test message {i}" for i in range(batch_size)]

        # Act
        start_time = time.time()
        results = model.predict(test_inputs)
        duration = time.time() - start_time

        throughput = batch_size / duration

        # Assert
        assert throughput > 50, f"Throughput {throughput} predictions/sec below threshold"

Test Data Management

Test Data Strategy

Principle: Use realistic but small datasets for testing.

Test Data Sources:

Synthetic data: Generated programmatically
Anonymized samples: Real data with PII removed
Fixture data: Hand-crafted examples for edge cases

Creating Test Data

Fixture Approach:

# tests/fixtures/sample_data.py
import pandas as pd

def sample_messages():
    """Generate sample message data for testing"""
    return pd.DataFrame({
        'message_id': ['msg_001', 'msg_002', 'msg_003'],
        'content': [
            'I love working with this team!',
            'This is really frustrating.',
            'Meeting rescheduled to next week.'
        ],
        'timestamp': pd.to_datetime([
            '2025-01-01 10:00:00',
            '2025-01-01 11:00:00',
            '2025-01-01 12:00:00'
        ])
    })

def sample_sentiment_results():
    """Expected sentiment results for sample messages"""
    return pd.DataFrame({
        'message_id': ['msg_001', 'msg_002', 'msg_003'],
        'sentiment_polarity': ['positive', 'negative', 'neutral'],
        'sentiment_score': [0.95, 0.12, 0.50],
    })

Using Fixtures in Tests:

# tests/unit/test_with_fixtures.py
import pytest
from tests.fixtures.sample_data import sample_messages, sample_sentiment_results

def test_sentiment_analysis_with_fixture():
    """Test sentiment analysis using fixture data"""
    # Arrange
    messages = sample_messages()
    expected = sample_sentiment_results()

    # Act
    # Run sentiment analysis on messages

    # Assert
    # Compare results to expected

Test Data in Databricks

Sandbox Test Tables:

-- Create test data in sandbox
CREATE OR REPLACE TABLE taylorlaing_sandbox.bronze.test_messages AS
SELECT
  'msg_001' as message_id,
  'I love this!' as content,
  current_timestamp() as timestamp
UNION ALL
SELECT 'msg_002', 'This is terrible.', current_timestamp()
UNION ALL
SELECT 'msg_003', 'It is okay.', current_timestamp();

Cleanup Test Data:

# tests/conftest.py
import pytest
from databricks.sdk import WorkspaceClient

@pytest.fixture(scope="session")
def databricks_client():
    """Databricks client for integration tests"""
    return WorkspaceClient()

@pytest.fixture
def test_table(databricks_client):
    """Create and cleanup test table"""
    # Setup
    table_name = "taylorlaing_sandbox.bronze.test_messages"
    databricks_client.sql.execute(f"""
        CREATE OR REPLACE TABLE {table_name} AS
        SELECT 'msg_001' as message_id, 'Test' as content
    """)

    yield table_name

    # Cleanup
    databricks_client.sql.execute(f"DROP TABLE IF EXISTS {table_name}")

CI/CD Test Automation

Tests in PR Validation

Current State: PR validation workflow validates bundle configuration but doesn't run tests.

Recommended Addition (add to .github/workflows/ml_pipelines_pr_validate.yml):

- name: Run tests
  run: |
    uv sync --dev
    uv run pytest tests/unit/ -v --cov=src --cov-report=xml

- name: Upload coverage
  uses: codecov/codecov-action@v3
  with:
    file: ./coverage.xml
    fail_ci_if_error: true

Test Requirements for PR Merge

Recommended Policy:

All unit tests must pass
Code coverage must not decrease
New features must include tests
Bug fixes must include regression tests

Mocking Strategies

Mocking External Dependencies

Mock Databricks API:

# tests/unit/test_with_mocks.py
import pytest
from unittest.mock import Mock, patch

def test_model_registration_with_mock():
    """Test model registration with mocked MLflow"""
    # Arrange
    with patch('mlflow.pyfunc.log_model') as mock_log:
        mock_log.return_value = Mock(run_id='test_run_123')

        # Act
        # Call model registration function

        # Assert
        mock_log.assert_called_once()

Mock Model Predictions:

def test_pipeline_with_mock_model():
    """Test pipeline with mocked model predictions"""
    # Arrange
    with patch('models.internal.sentiment_analysis.model.SentimentAnalysisModel.predict') as mock_predict:
        mock_predict.return_value = [{
            'sentiment_polarity': 'positive',
            'sentiment_score': 0.95
        }]

        # Act
        # Run pipeline

        # Assert
        mock_predict.assert_called()

Mocking Databricks Utilities

Mock dbutils:

@pytest.fixture
def mock_dbutils():
    """Mock Databricks dbutils"""
    dbutils = Mock()
    dbutils.widgets.get.return_value = "test_catalog"
    return dbutils

def test_with_dbutils(mock_dbutils, monkeypatch):
    """Test function that uses dbutils"""
    monkeypatch.setattr('databricks.sdk.dbutils', mock_dbutils)

    # Test code that uses dbutils

Coverage Requirements

Target Coverage Levels

Critical paths: 90%+ coverage
- Model prediction logic
- Data validation functions
- Configuration parsing
Standard code: 80%+ coverage
- Utility functions
- Data transformations
- API handlers
Infrastructure code: 60%+ coverage
- Deployment scripts
- CI/CD utilities

Measuring Coverage

Generate coverage report:

uv run pytest --cov=src --cov-report=html --cov-report=term

View HTML report:

open htmlcov/index.html

Check specific file coverage:

uv run pytest --cov=src/models/internal/sentiment_analysis/ --cov-report=term-missing

Testing Best Practices

Do's

Write tests first (TDD when possible)
Test one thing per test
Use descriptive test names
Follow AAA pattern (Arrange, Act, Assert)
Use fixtures for setup/teardown
Mock external dependencies
Test edge cases and error conditions
Keep tests fast
Make tests deterministic
Clean up test data

Don'ts

Don't test implementation details
Don't write flaky tests
Don't skip test cleanup
Don't hardcode credentials in tests
Don't test external APIs directly
Don't duplicate test code
Don't ignore failing tests
Don't test framework code
Don't write tests that depend on order

Debugging Guide - Troubleshooting test failures
Code Standards - Code quality requirements
CI/CD Pipeline - Automated testing in CI/CD
Contributing Guide - PR requirements and testing expectations

PreviousLearnings from Deploying a Custom Python Model for ai_query NextExecutive

Last updated 5 months ago

hashtagOverview

hashtagTesting Philosophy

hashtagTest Pyramid

hashtagTesting Principles

hashtagCurrent Test Status

hashtagUnit Testing

hashtagWhat to Unit Test

hashtagUnit Test Structure

hashtagExample Unit Test

hashtagRunning Unit Tests

hashtagIntegration Testing

hashtagWhat to Integration Test

hashtagDLT Pipeline Testing

hashtagModel Registration Testing

hashtagRunning Integration Tests

hashtagModel Testing

hashtagModel Validation

hashtagPerformance Testing

hashtagTest Data Management

hashtagTest Data Strategy

hashtagCreating Test Data

hashtagTest Data in Databricks

hashtagCI/CD Test Automation

hashtagTests in PR Validation

hashtagTest Requirements for PR Merge

hashtagMocking Strategies

hashtagMocking External Dependencies

hashtagMocking Databricks Utilities

hashtagCoverage Requirements

hashtagTarget Coverage Levels

hashtagMeasuring Coverage

hashtagTesting Best Practices

hashtagDo's

hashtagDon'ts

hashtagRelated Documentation