Glossary

Overview

This glossary defines terminology, acronyms, and concepts used throughout the ML Pipelines project.


A

ai_query()

Databricks SQL function that invokes ML model serving endpoints from Delta Live Tables pipelines. Enables real-time inference on streaming data.

Example: ai_query('sentiment_analysis', message_content)

Alias (MLflow)

Named reference to a specific model version in MLflow Model Registry (e.g., champion, challenger, archive).

Artifacts (MLflow)

Files associated with an ML model run, including model binaries, configuration files, and supplementary data.


B

Bronze Layer

First layer in medallion architecture containing raw, unprocessed data ingested from source systems. No cleaning or transformation applied.

Example: dev.bronze.messages contains raw message data exactly as received.

See Also: Data Flow Architecture for complete medallion architecture details.

Bundle (DAB)

Databricks Asset Bundle - a collection of Databricks resources (jobs, pipelines, models) defined in YAML and deployed as a unit.

See Also: Configuration Reference for databricks.yml structure.

Binary Promotion

Copying a trained model binary from one environment to another without retraining. Ensures the same model tested in staging runs in production.

See Also: Model Promotion Architecture for complete model lifecycle.


C

Catalog

Top-level container in Unity Catalog that organizes schemas and tables. Provides namespace isolation between environments.

Examples: dev, staging, prod, taylorlaing_sandbox

See Also: Unity Catalog Architecture for complete catalog structure and permissions.

Champion (Model)

The current production model version, designated by the champion alias in MLflow Model Registry.

CI/CD

Continuous Integration / Continuous Deployment. Automated process of testing and deploying code changes.

Cluster

Collection of compute resources (VMs) in Databricks that execute code. Can be interactive or job-specific.


D

DAB

Databricks Asset Bundle. Infrastructure-as-code framework for deploying Databricks resources (jobs, pipelines, models) across environments.

Data Drift

Change in data distribution over time that can degrade model performance.

Example: If training data had 60% positive sentiment but production data has 80% positive, the model may perform worse.

DBU

Databricks Unit. Pricing unit for Databricks compute. Usage is measured in DBUs per hour.

Delta Lake

Open-source storage layer providing ACID transactions, time travel, and schema enforcement for data lakes.

Delta Live Tables (DLT)

Databricks framework for building and managing data pipelines with declarative SQL/Python. Automatically handles dependencies, quality checks, and monitoring.


E

Endpoint (Model Serving)

REST API that serves predictions from a deployed ML model. Created from registered models in Unity Catalog.

Example: sentiment_analysis endpoint serves predictions from prod.models.sentiment_analysis.

Expectation (DLT)

Data quality constraint in Delta Live Tables that validates data meets specified conditions.

Example: @dlt.expect("valid_score", "score BETWEEN 0 AND 1")

External Location

Unity Catalog object that defines credentials and permissions for accessing external storage (S3, Azure Blob, GCS).


F

Feature Engineering

Process of transforming raw data into features suitable for ML models.

Example: Converting raw text messages into sentiment scores and emotional features.


G

GitHub OIDC

OpenID Connect authentication between GitHub Actions and Databricks. Enables passwordless CI/CD without storing credentials.

See Also: Service Principals Guide and ADR-003 for setup details.

Gold Layer

Third layer in medallion architecture containing aggregations that join bronze, silver, and gold tables to create unique insights built on trends and multiple data points. Optimized for analytics and reporting.

Example: prod.gold.daily_sentiment_metrics aggregates sentiment scores across time periods

See Also: Data Flow Architecture for complete medallion architecture details.


I

Inference

Process of using a trained ML model to make predictions on new data.


J

Job

Scheduled or on-demand execution of notebooks, scripts, or pipelines in Databricks.

Example: prod_register_sentiment_analysis job trains and registers sentiment analysis model.


M

Medallion Architecture

Data architecture pattern with three layers: Bronze (raw), Silver (cleaned), Gold (aggregated).

Metastore

Top-level Unity Catalog container that stores metadata about catalogs, schemas, tables, and permissions. Can be shared across workspaces.

MLflow

Open-source platform for managing the ML lifecycle, including experiment tracking, model registry, and model serving.

MLOps

Machine Learning Operations. Practices and tools for deploying, monitoring, and maintaining ML models in production.

Model Registry

MLflow component that manages model versions, metadata, and lifecycle stages.

Example: prod.models.sentiment_analysis is a registered model.


N

Notebook

Interactive document combining code, visualizations, and markdown. Used for data exploration and development in Databricks.


O

OIDC

OpenID Connect. Authentication protocol used by GitHub Actions to authenticate to Databricks without storing credentials.


P

Pipeline (DLT)

Delta Live Tables data processing workflow that transforms data through multiple stages.

Example: sentiment-analysis-prod pipeline processes messages and generates sentiment features.

Plutchik's Wheel

Emotion classification framework with 8 basic emotions used in sentiment analysis model.

Emotions: Joy, Trust, Anticipation, Surprise, Sadness, Anger, Fear, Disgust

Production (Prod)

Live environment serving real users and business processes. Requires highest reliability and approval gates.


R

RoBERTa

Robustly Optimized BERT Approach. Transformer-based ML model for natural language processing.

Example: roberta-base-go-emotions model used for emotion classification.

Run (MLflow)

Single execution of an ML experiment, tracking parameters, metrics, and artifacts.


S

Sandbox

Personal development environment for individual developers, isolated from shared resources.

Example: taylorlaing_sandbox catalog.

Schema

Second-level container in Unity Catalog, organized within a catalog. Contains tables and views.

Examples: bronze, silver, gold, models

Service Principal

Non-human identity used for automation and CI/CD. Has specific permissions and audit trail separate from user accounts.

Example: ml-pipelines-dev service principal with ID 03ff99cd-a352-40bb-9d33-414c9ad9e7aa

Signature (Model)

MLflow schema definition specifying model input and output format. Required for model serving endpoints.

Silver Layer

Second layer in medallion architecture containing cleaned, validated data with model predictions. The silver layer runs ai_query inference to add predictions to cleaned data.

Example: prod.silver.sentiment_features contains cleaned message data with sentiment predictions from ai_query.

See Also: Data Flow Architecture for complete silver layer details and Model Deployment Guide for ai_query patterns.

Spark

Distributed computing framework used by Databricks for processing large datasets.

Staging

Pre-production environment that replicates production setup for final validation before deployment.

Streaming

Continuous data processing where new data is processed as it arrives, rather than in batches.


T

Target

Deployment environment defined in databricks.yml (e.g., dev, staging, prod, sandbox).

Time Travel

Delta Lake feature allowing queries of historical table versions.

Example: SELECT * FROM table VERSION AS OF 42 or RESTORE TABLE TO VERSION AS OF 42


U

Unity Catalog

Databricks unified governance solution for data and AI assets. Provides centralized access control, auditing, and lineage.

See Also: Unity Catalog Architecture for complete catalog structure, permissions model, and governance policies.

UV

Fast Python package manager used in this project. Alternative to pip.

Usage: uv sync --dev to install dependencies.


V

Version (Model)

Specific iteration of a registered model in MLflow Model Registry. Versions are numbered sequentially (1, 2, 3, ...).

Volume

Unity Catalog object for storing unstructured data (files) with governance and access control.

Example: dev.bronze.raw_messages volume stores raw message files from S3.


W

Wheel

Python package distribution format (.whl file). Built from source code for deployment.

Example: ml_pipelines-0.1.0-py3-none-any.whl

Workspace

Databricks environment containing notebooks, jobs, clusters, and other resources. Can be dev, staging, or prod workspace.


Acronyms Quick Reference

Acronym
Full Term
Description

API

Application Programming Interface

Interface for interacting with software

CI/CD

Continuous Integration / Continuous Deployment

Automated testing and deployment

CLI

Command Line Interface

Text-based interface for commands

DAB

Databricks Asset Bundle

IaC framework for Databricks

DBU

Databricks Unit

Pricing unit for compute

DLT

Delta Live Tables

Databricks pipeline framework

IAM

Identity and Access Management

Access control system

IaC

Infrastructure as Code

Managing infrastructure via code

JSON

JavaScript Object Notation

Data interchange format

ML

Machine Learning

AI models learning from data

MLflow

ML Flow

ML lifecycle platform

MLOps

Machine Learning Operations

ML deployment and operations

NLP

Natural Language Processing

AI for understanding text

OIDC

OpenID Connect

Authentication protocol

PR

Pull Request

Code review and merge request

REST

Representational State Transfer

API architectural style

S3

Simple Storage Service

AWS object storage

SDK

Software Development Kit

Tools for software development

SQL

Structured Query Language

Database query language

TDD

Test-Driven Development

Write tests before code

UC

Unity Catalog

Databricks governance solution

UI

User Interface

Visual interface for interaction

URL

Uniform Resource Locator

Web address

UUID

Universally Unique Identifier

Unique ID for resources

YAML

YAML Ain't Markup Language

Configuration file format


Common Databricks Terms

Account Console

Databricks administrative interface for managing workspaces, service principals, and billing across an organization.

Auto-termination

Automatic shutdown of idle clusters after specified time period to save costs.

Cluster Policy

Template defining allowed cluster configurations to enforce governance and cost controls.

Data Security Mode

Cluster configuration determining isolation level and access controls.

Modes:

  • SINGLE_USER: Single user, full access

  • USER_ISOLATION: Multiple users, process isolation

  • NONE: Legacy shared mode

Notebook Workflow

Databricks job that executes notebooks in sequence or parallel.

Repos

Git integration in Databricks workspace for version-controlled development.

Secrets

Databricks key-value storage for sensitive information like API keys and passwords.

Access: dbutils.secrets.get("scope", "key")

Serverless

Databricks compute mode that automatically provisions and scales resources.

Spark UI

Web interface for monitoring and debugging Spark jobs, showing stages, tasks, and execution plans.

Workspace

Databricks collaborative environment containing notebooks, experiments, and resources.


Project-Specific Terms

Champion Model

Current production model version with best performance, designated by champion alias.

Feature Analysis

Extraction of workplace-related features (autonomy, belonging, competence, etc.) from text.

Go Emotions

Set of 28 emotions from Google's emotion classification dataset used in sentiment model.

Sandbox Deployment

Personal development environment deployment using make deploy command.

Two-Stage Pipeline

Pattern for handling ai_query with dynamic output schemas. Stage 1 captures raw results, stage 2 casts to expected types.


Last updated