Learnings from Deploying a Custom Python Model for ai_query

This document captures the key findings and best practices discovered while debugging and deploying the emoji_analysis model for use with the Databricks ai_query function. The process revealed several critical subtleties in how ai_query interacts with MLflow model serving endpoints.

Key Takeaway

The ai_query function's automatic schema inference is extremely strict. To ensure compatibility, the MLflow model must return a JSON response that is not only well-formed but also has a 100% stable and predictable schema. Any ambiguity in the output structure or data types will cause the AI_FUNCTION_MODEL_SCHEMA_PARSE_ERROR.


What Didn't Work and Why

  1. Returning a Dictionary from predict:

    • Attempt: The model's predict method returned a dictionary like {"predictions": [...]}.

    • Problem: This caused a "double-wrapping" issue. The MLflow serving layer adds its own {"predictions": [...]} wrapper, resulting in a final response like {"predictions": {"predictions": [...]}}, which ai_query cannot parse.

  2. Inconsistent Data Types (The Serialization Problem):

    • Attempt: The model returned native Python numeric types (float, int, Decimal) and nested JSON objects as strings.

    • Problem: ai_query's parser failed on the numeric types (REMOTE_FUNCTION_HTTP_RESULT_PARSE_ERROR) and was confused by fields that were sometimes strings and sometimes objects. The parser requires absolute consistency.

  3. Dynamic Schema Structures:

    • Attempt: Parts of the model's output had a dynamic structure. For example, a features dictionary where the keys were the names of detected features, or a sentiment dictionary with a key like "positive" or "negative".

    • Problem: MLflow's signature inference engine correctly identified these structures as having an unpredictable schema, marking them as type: object, format: any. The ai_query schema parser cannot handle the any type and fails immediately.

  4. Relying on "Lucky" Signature Inference:

    • Attempt: Using a simple, arbitrary input example to infer the model's output signature.

    • Problem: If the sample input didn't trigger all possible output structures (e.g., it produced an empty features list), the inferred signature would be incomplete. Specifically, the schema for an empty list is inferred as Any, which guarantees a downstream ai_query failure.

  5. Inconsistent Schema in Edge Cases:

    • Attempt: The model had different logic for handling messages with zero emojis versus those with one or more.

    • Problem: An early return statement in the zero-emoji case had a bug where it stringified a nested JSON field, while the main code path did not. This created an inconsistent schema depending on the input data, causing parsing failures.


What Works: Best Practices for ai_query Compatibility

  1. The predict Method MUST Return a List:

    • The Python function should return a list of result objects (e.g., [result_dict]). MLflow will correctly wrap this in the final {"predictions": [...]} JSON response.

  2. Ensure a 100% Stable Output Schema:

    • No Dynamic Keys: Convert any dictionaries with dynamic keys into arrays of objects with fixed keys.

      • Incorrect: {"positive": 0.9}

      • Correct: {"sentiment": "positive", "score": 0.9}

    • No Generic Objects: Ensure every object and nested object in the output has a consistent, predictable set of keys.

  3. Stringify All Primitive Data Types:

    • The most robust way to avoid data type parsing errors is to recursively traverse the entire output object and convert all primitive values (int, float, Decimal, bool, datetime) to their string representations.

    • This creates a completely unambiguous JSON payload that ai_query can reliably parse. The final casting to correct SQL types (CAST(... AS INT)) should be done in the DLT pipeline itself.

  4. Guarantee a "Golden" Example for Signature Inference:

    • When logging the model, do not rely on a random sample input. Instead, programmatically construct a "golden" output example that is guaranteed to contain every possible field and nested structure, including non-empty lists for all array fields.

    • Use this golden example with mlflow.infer_signature to ensure the registered model has a complete and accurate schema.

  5. Use a Two-Step DLT Pipeline:

    • Step 1 (Intermediate Table): Create a streaming live table that calls ai_query and lands the raw, string-heavy results. This isolates the model call.

    • Step 2 (Final Silver Table): Create the final live table by reading from the intermediate table. In this step, perform all CAST operations to convert the string fields to their proper SQL data types (DECIMAL, INT, TIMESTAMP, etc.) and use to_json() on any struct fields that need to be stored as strings.


The Successful Solution: Why It Worked

The Golden Output Construction Method

The breakthrough came from constructing a 100% complete, programmatically generated "golden" output example for MLflow signature inference. This approach eliminated all ambiguity in the schema.

What Made It Work:

  1. Complete Field Coverage: Every single field from the target silver schema (59 fields total) was explicitly included in the canonical output, including all optional fields.

  2. Populated Nested Structures: Instead of relying on empty lists or dynamic objects, every nested structure was populated with complete sample objects:

    • emojis array: Contains complete emoji match objects with all required fields

    • features array: Contains complete feature objects with populated positive_matches and negative_matches arrays

    • sentiment_analysis: Contains complete sentiment result objects

  3. Stable Schema Design: All dynamic dictionary structures were converted to arrays of objects with fixed keys:

    • features changed from {"gratitude": {...}} to [{"feature_name": "gratitude", ...}]

    • sentiment_result changed from {"positive": 0.9} to {"sentiment": "positive", "score": "0.9"}

  4. Stringified Primitives: All numeric types, booleans, and timestamps were converted to strings, ensuring JSON serialization compatibility.

The Code That Made It Work:

The Two-Step DLT Pipeline

The second critical component was the two-step DLT pipeline design:

Step 1: Intermediate Table (emoji_features_intermediate)

Step 2: Final Silver Table (emoji_features)

Why This Solution Succeeded

  1. Eliminated Schema Ambiguity: The golden output ensured every field had a concrete, unambiguous type in the MLflow signature.

  2. Robust Data Type Handling: Stringifying all primitives in the model output, then casting to proper types in the DLT pipeline, eliminated data type parsing errors.

  3. Clean Separation of Concerns: The intermediate table isolates the model call, while the final table handles all data type conversions and schema compliance.

  4. Future-Proof Design: Any changes to the model's output structure only require updating the golden example, not the entire pipeline.

This approach transformed a complex, error-prone Python UDF pipeline into a clean, maintainable SQL-based solution that leverages Databricks' native ai_query function for optimal performance and reliability.


Production DLT Streaming Pipeline Learnings

After successfully deploying the model for ai_query compatibility, we discovered additional critical insights when implementing the production DLT streaming pipeline.

Key Discovery: Schema Conflicts vs Performance Issues

Initial Misdiagnosis: When DLT streaming tables with ai_query showed zero rows scanned and written, we initially suspected fundamental incompatibility between streaming and AI functions.

Actual Root Cause: The streaming pipeline was actually working (making AI calls, processing data) but failing on schema conflicts when trying to write results to existing tables.

Evidence from Error Logs:

This error revealed that:

  1. AI calls were succeeding - The pipeline was processing ai_query calls

  2. Data was being read - Bronze streaming was working

  3. Results were being generated - The AI model was returning data

  4. Schema conflicts prevented writes - Existing tables had incompatible schemas

Production Streaming Pipeline Best Practices

1. Use CREATE OR REPLACE for Schema Cleanup

2. Two-Stage Streaming Architecture

  • Stage 1: Fast AI processing β†’ emoji_features_raw (immediate visibility)

  • Stage 2: Field extraction β†’ emoji_features (final clean table)

This separation provides:

  • Immediate progress visibility - See AI results in Stage 1 table

  • Error isolation - Distinguish AI issues from field extraction issues

  • Performance optimization - AI calls separated from complex CAST operations

3. Optimized Pipeline Configuration

4. File-Based Dependency Management

DLT processes files alphabetically, ensuring proper execution order.

What Actually Works in Production

  1. ai_query + Streaming is Compatible - The combination works when schema conflicts are resolved

  2. Performance is Good - 30-50 AI requests/second with low CPU usage on model endpoint

  3. Schema Management is Critical - Use CREATE OR REPLACE to avoid conflicts

  4. Two-stage architecture provides visibility - Essential for debugging and monitoring

  5. Pipeline-level optimizations matter - Proper batching and concurrency settings are crucial

Common Troubleshooting

Symptom
Likely Cause
Solution

Zero rows scanned

Schema conflict

Use CREATE OR REPLACE

AI calls but no writes

Field extraction bottleneck

Use two-stage architecture

Pipeline hangs on "Running"

Streaming config too aggressive

Increase trigger intervals

Memory errors

Batch size too large

Reduce maxBytesPerTrigger

This production deployment experience shows that ai_query streaming pipelines are viable and performant when properly configured, contrary to initial research suggesting incompatibility.

Last updated