Glossary
Overview
This glossary defines terminology, acronyms, and concepts used throughout the ML Pipelines project.
A
ai_query()
Databricks SQL function that invokes ML model serving endpoints from Delta Live Tables pipelines. Enables real-time inference on streaming data.
Example: ai_query('sentiment_analysis', message_content)
Alias (MLflow)
Named reference to a specific model version in MLflow Model Registry (e.g., champion, challenger, archive).
Artifacts (MLflow)
Files associated with an ML model run, including model binaries, configuration files, and supplementary data.
B
Bronze Layer
First layer in medallion architecture containing raw, unprocessed data ingested from source systems. No cleaning or transformation applied.
Example: dev.bronze.messages contains raw message data exactly as received.
See Also: Data Flow Architecture for complete medallion architecture details.
Bundle (DAB)
Databricks Asset Bundle - a collection of Databricks resources (jobs, pipelines, models) defined in YAML and deployed as a unit.
See Also: Configuration Reference for databricks.yml structure.
Binary Promotion
Copying a trained model binary from one environment to another without retraining. Ensures the same model tested in staging runs in production.
See Also: Model Promotion Architecture for complete model lifecycle.
C
Catalog
Top-level container in Unity Catalog that organizes schemas and tables. Provides namespace isolation between environments.
Examples: dev, staging, prod, taylorlaing_sandbox
See Also: Unity Catalog Architecture for complete catalog structure and permissions.
Champion (Model)
The current production model version, designated by the champion alias in MLflow Model Registry.
CI/CD
Continuous Integration / Continuous Deployment. Automated process of testing and deploying code changes.
Cluster
Collection of compute resources (VMs) in Databricks that execute code. Can be interactive or job-specific.
D
DAB
Databricks Asset Bundle. Infrastructure-as-code framework for deploying Databricks resources (jobs, pipelines, models) across environments.
Data Drift
Change in data distribution over time that can degrade model performance.
Example: If training data had 60% positive sentiment but production data has 80% positive, the model may perform worse.
DBU
Databricks Unit. Pricing unit for Databricks compute. Usage is measured in DBUs per hour.
Delta Lake
Open-source storage layer providing ACID transactions, time travel, and schema enforcement for data lakes.
Delta Live Tables (DLT)
Databricks framework for building and managing data pipelines with declarative SQL/Python. Automatically handles dependencies, quality checks, and monitoring.
E
Endpoint (Model Serving)
REST API that serves predictions from a deployed ML model. Created from registered models in Unity Catalog.
Example: sentiment_analysis endpoint serves predictions from prod.models.sentiment_analysis.
Expectation (DLT)
Data quality constraint in Delta Live Tables that validates data meets specified conditions.
Example: @dlt.expect("valid_score", "score BETWEEN 0 AND 1")
External Location
Unity Catalog object that defines credentials and permissions for accessing external storage (S3, Azure Blob, GCS).
F
Feature Engineering
Process of transforming raw data into features suitable for ML models.
Example: Converting raw text messages into sentiment scores and emotional features.
G
GitHub OIDC
OpenID Connect authentication between GitHub Actions and Databricks. Enables passwordless CI/CD without storing credentials.
See Also: Service Principals Guide and ADR-003 for setup details.
Gold Layer
Third layer in medallion architecture containing aggregations that join bronze, silver, and gold tables to create unique insights built on trends and multiple data points. Optimized for analytics and reporting.
Example: prod.gold.daily_sentiment_metrics aggregates sentiment scores across time periods
See Also: Data Flow Architecture for complete medallion architecture details.
I
Inference
Process of using a trained ML model to make predictions on new data.
J
Job
Scheduled or on-demand execution of notebooks, scripts, or pipelines in Databricks.
Example: prod_register_sentiment_analysis job trains and registers sentiment analysis model.
M
Medallion Architecture
Data architecture pattern with three layers: Bronze (raw), Silver (cleaned), Gold (aggregated).
Metastore
Top-level Unity Catalog container that stores metadata about catalogs, schemas, tables, and permissions. Can be shared across workspaces.
MLflow
Open-source platform for managing the ML lifecycle, including experiment tracking, model registry, and model serving.
MLOps
Machine Learning Operations. Practices and tools for deploying, monitoring, and maintaining ML models in production.
Model Registry
MLflow component that manages model versions, metadata, and lifecycle stages.
Example: prod.models.sentiment_analysis is a registered model.
N
Notebook
Interactive document combining code, visualizations, and markdown. Used for data exploration and development in Databricks.
O
OIDC
OpenID Connect. Authentication protocol used by GitHub Actions to authenticate to Databricks without storing credentials.
P
Pipeline (DLT)
Delta Live Tables data processing workflow that transforms data through multiple stages.
Example: sentiment-analysis-prod pipeline processes messages and generates sentiment features.
Plutchik's Wheel
Emotion classification framework with 8 basic emotions used in sentiment analysis model.
Emotions: Joy, Trust, Anticipation, Surprise, Sadness, Anger, Fear, Disgust
Production (Prod)
Live environment serving real users and business processes. Requires highest reliability and approval gates.
R
RoBERTa
Robustly Optimized BERT Approach. Transformer-based ML model for natural language processing.
Example: roberta-base-go-emotions model used for emotion classification.
Run (MLflow)
Single execution of an ML experiment, tracking parameters, metrics, and artifacts.
S
Sandbox
Personal development environment for individual developers, isolated from shared resources.
Example: taylorlaing_sandbox catalog.
Schema
Second-level container in Unity Catalog, organized within a catalog. Contains tables and views.
Examples: bronze, silver, gold, models
Service Principal
Non-human identity used for automation and CI/CD. Has specific permissions and audit trail separate from user accounts.
Example: ml-pipelines-dev service principal with ID 03ff99cd-a352-40bb-9d33-414c9ad9e7aa
Signature (Model)
MLflow schema definition specifying model input and output format. Required for model serving endpoints.
Silver Layer
Second layer in medallion architecture containing cleaned, validated data with model predictions. The silver layer runs ai_query inference to add predictions to cleaned data.
Example: prod.silver.sentiment_features contains cleaned message data with sentiment predictions from ai_query.
See Also: Data Flow Architecture for complete silver layer details and Model Deployment Guide for ai_query patterns.
Spark
Distributed computing framework used by Databricks for processing large datasets.
Staging
Pre-production environment that replicates production setup for final validation before deployment.
Streaming
Continuous data processing where new data is processed as it arrives, rather than in batches.
T
Target
Deployment environment defined in databricks.yml (e.g., dev, staging, prod, sandbox).
Time Travel
Delta Lake feature allowing queries of historical table versions.
Example: SELECT * FROM table VERSION AS OF 42 or RESTORE TABLE TO VERSION AS OF 42
U
Unity Catalog
Databricks unified governance solution for data and AI assets. Provides centralized access control, auditing, and lineage.
See Also: Unity Catalog Architecture for complete catalog structure, permissions model, and governance policies.
UV
Fast Python package manager used in this project. Alternative to pip.
Usage: uv sync --dev to install dependencies.
V
Version (Model)
Specific iteration of a registered model in MLflow Model Registry. Versions are numbered sequentially (1, 2, 3, ...).
Volume
Unity Catalog object for storing unstructured data (files) with governance and access control.
Example: dev.bronze.raw_messages volume stores raw message files from S3.
W
Wheel
Python package distribution format (.whl file). Built from source code for deployment.
Example: ml_pipelines-0.1.0-py3-none-any.whl
Workspace
Databricks environment containing notebooks, jobs, clusters, and other resources. Can be dev, staging, or prod workspace.
Acronyms Quick Reference
API
Application Programming Interface
Interface for interacting with software
CI/CD
Continuous Integration / Continuous Deployment
Automated testing and deployment
CLI
Command Line Interface
Text-based interface for commands
DAB
Databricks Asset Bundle
IaC framework for Databricks
DBU
Databricks Unit
Pricing unit for compute
DLT
Delta Live Tables
Databricks pipeline framework
IAM
Identity and Access Management
Access control system
IaC
Infrastructure as Code
Managing infrastructure via code
JSON
JavaScript Object Notation
Data interchange format
ML
Machine Learning
AI models learning from data
MLflow
ML Flow
ML lifecycle platform
MLOps
Machine Learning Operations
ML deployment and operations
NLP
Natural Language Processing
AI for understanding text
OIDC
OpenID Connect
Authentication protocol
PR
Pull Request
Code review and merge request
REST
Representational State Transfer
API architectural style
S3
Simple Storage Service
AWS object storage
SDK
Software Development Kit
Tools for software development
SQL
Structured Query Language
Database query language
TDD
Test-Driven Development
Write tests before code
UC
Unity Catalog
Databricks governance solution
UI
User Interface
Visual interface for interaction
URL
Uniform Resource Locator
Web address
UUID
Universally Unique Identifier
Unique ID for resources
YAML
YAML Ain't Markup Language
Configuration file format
Common Databricks Terms
Account Console
Databricks administrative interface for managing workspaces, service principals, and billing across an organization.
Auto-termination
Automatic shutdown of idle clusters after specified time period to save costs.
Cluster Policy
Template defining allowed cluster configurations to enforce governance and cost controls.
Data Security Mode
Cluster configuration determining isolation level and access controls.
Modes:
SINGLE_USER: Single user, full accessUSER_ISOLATION: Multiple users, process isolationNONE: Legacy shared mode
Notebook Workflow
Databricks job that executes notebooks in sequence or parallel.
Repos
Git integration in Databricks workspace for version-controlled development.
Secrets
Databricks key-value storage for sensitive information like API keys and passwords.
Access: dbutils.secrets.get("scope", "key")
Serverless
Databricks compute mode that automatically provisions and scales resources.
Spark UI
Web interface for monitoring and debugging Spark jobs, showing stages, tasks, and execution plans.
Workspace
Databricks collaborative environment containing notebooks, experiments, and resources.
Project-Specific Terms
Champion Model
Current production model version with best performance, designated by champion alias.
Feature Analysis
Extraction of workplace-related features (autonomy, belonging, competence, etc.) from text.
Go Emotions
Set of 28 emotions from Google's emotion classification dataset used in sentiment model.
Sandbox Deployment
Personal development environment deployment using make deploy command.
Two-Stage Pipeline
Pattern for handling ai_query with dynamic output schemas. Stage 1 captures raw results, stage 2 casts to expected types.
Related Documentation
Naming Conventions - Naming standards for all assets
Configuration Reference - databricks.yml and settings
CLI Commands - Makefile and Databricks CLI
Architecture Overview - System architecture documentation
Last updated