databricks-ai-functions

star 159

Use Databricks built-in AI Functions (ai_classify, ai_extract, ai_summarize, ai_mask, ai_translate, ai_fix_grammar, ai_gen, ai_analyze_sentiment, ai_similarity, ai_parse_document, ai_query, ai_forecast) to add AI capabilities directly to SQL and PySpark pipelines without managing model endpoints. Also covers document parsing and building custom RAG pipelines (parse → chunk → index → query).

databricks By databricks schedule Updated 6/15/2026

name: databricks-ai-functions description: "Use Databricks built-in AI Functions (ai_classify, ai_extract, ai_summarize, ai_mask, ai_translate, ai_fix_grammar, ai_gen, ai_analyze_sentiment, ai_similarity, ai_parse_document, ai_query, ai_forecast) to add AI capabilities directly to SQL and PySpark pipelines without managing model endpoints. Also covers document parsing and building custom RAG pipelines (parse → chunk → index → query)." compatibility: Requires databricks CLI (>= v1.0.0) metadata: version: "0.1.0" parent: databricks-core

Databricks AI Functions

Official Docs: https://docs.databricks.com/large-language-models/ai-functions Individual function reference: https://docs.databricks.com/sql/language-manual/functions/

Overview

Databricks AI Functions are built-in SQL and PySpark functions that call Foundation Model APIs directly from your data pipelines — no model endpoint setup, no API keys, no boilerplate. They operate on table columns as naturally as UPPER() or LENGTH(), and are optimized for batch inference at scale.

There are three categories:

Category Functions Use when
Task-specific ai_analyze_sentiment, ai_classify, ai_extract, ai_fix_grammar, ai_gen, ai_mask, ai_similarity, ai_summarize, ai_translate, ai_parse_document The task is well-defined — prefer these always
General-purpose ai_query Complex nested JSON, custom endpoints, multimodal — last resort only
Table-valued ai_forecast Time series forecasting

Function selection rule — always prefer a task-specific function over ai_query:

Task Use this Fall back to ai_query when...
Sentiment scoring ai_analyze_sentiment Never
Fixed-label routing ai_classify (2–500 labels; add descriptions for accuracy) Never
Entity / field extraction ai_extract Never
Summarization ai_summarize Never — use max_words=0 for uncapped
Grammar correction ai_fix_grammar Never
Translation ai_translate Target language not in the supported list
PII redaction ai_mask Never
Free-form generation ai_gen Need structured JSON output
Semantic similarity ai_similarity Never
PDF / document parsing ai_parse_document Need image-level reasoning
Complex JSON / reasoning This is the intended use case for ai_query

Prerequisites

  • Databricks SQL warehouse (not Classic) or cluster with DBR 15.1+
  • DBR 15.4 ML LTS recommended for batch workloads
  • DBR 17.1+ required for ai_parse_document
  • ai_forecast requires a Pro or Serverless SQL warehouse
  • Workspace in a supported AWS/Azure region for batch AI inference
  • Models run under Apache 2.0 or LLAMA 3.3 Community License — customers are responsible for compliance

Quick Start

Classify, extract, and score sentiment from a text column in a single query:

SELECT
    ticket_id,
    ticket_text,
    ai_classify(ticket_text, ARRAY('urgent', 'not urgent', 'spam')) AS priority,
    ai_extract(ticket_text, '["product", "error_code", "date"]')     AS entities,
    ai_analyze_sentiment(ticket_text)                                 AS sentiment
FROM support_tickets;
from pyspark.sql.functions import expr

df = spark.table("support_tickets")
df = (
    df.withColumn("priority",  expr("ai_classify(ticket_text, array('urgent', 'not urgent', 'spam'))"))
      .withColumn("entities",  expr("ai_extract(ticket_text, '[\"product\", \"error_code\", \"date\"]')"))
      .withColumn("sentiment", expr("ai_analyze_sentiment(ticket_text)"))
)
# ai_extract returns a VARIANT (fields under :response). Use colon (:) notation — dot returns NULL on a VARIANT.
df.selectExpr("ticket_id", "priority", "sentiment",
              "entities:response:product::string    AS product",
              "entities:response:error_code::string AS error_code",
              "entities:response:date::string       AS date").display()

Common Patterns

Pattern 1: Text Analysis Pipeline

Chain multiple task-specific functions to enrich a text column in one pass:

SELECT
    id,
    content,
    ai_analyze_sentiment(content)               AS sentiment,
    ai_summarize(content, 30)                   AS summary,
    ai_classify(content,
        ARRAY('technical', 'billing', 'other')) AS category,
    ai_fix_grammar(content)                     AS content_clean
FROM raw_feedback;

Pattern 2: PII Redaction Before Storage

from pyspark.sql.functions import expr

df_clean = (
    spark.table("raw_messages")
    .withColumn(
        "message_safe",
        expr("ai_mask(message, array('person', 'email', 'phone', 'address'))")
    )
)
df_clean.write.format("delta").mode("append").saveAsTable("catalog.schema.messages_safe")

Pattern 3: Document Ingestion from a Unity Catalog Volume

Parse PDFs/Office docs, then enrich with task-specific functions:

from pyspark.sql.functions import expr

df = (
    spark.read.format("binaryFile")
    .load("/Volumes/catalog/schema/landing/documents/")
    .withColumn("parsed", expr("ai_parse_document(content)"))
    # ai_parse_document returns a VARIANT — navigate with the colon (:) operator, never dot.
    # Shape: { "document": { "pages": [...], "elements": [...] }, "error_status": ..., "metadata": ... }
    .selectExpr("path",
                "concat_ws('\n', transform(parsed:document:elements, e -> e:content::STRING)) AS text_blocks",
                "parsed:error_status AS parse_error")
    .filter("parse_error IS NULL")
    .withColumn("summary",  expr("ai_summarize(text_blocks, 50)"))
    .withColumn("entities", expr("ai_extract(text_blocks, '[\"date\", \"amount\", \"vendor\"]')"))
)

Pattern 4: Semantic Matching / Deduplication

-- Find near-duplicate company names
SELECT a.id, b.id, ai_similarity(a.name, b.name) AS score
FROM companies a
JOIN companies b ON a.id < b.id
WHERE ai_similarity(a.name, b.name) > 0.85;

Pattern 5: Complex JSON Extraction with ai_query (last resort)

Use only when the output schema has nested arrays or requires multi-step reasoning that no task-specific function handles:

from pyspark.sql.functions import expr, from_json, col

df = (
    spark.table("parsed_documents")
    .withColumn("ai_response", expr("""
        ai_query(
            'databricks-claude-sonnet-4',
            concat('Extract invoice as JSON with nested itens array: ', text_blocks),
            responseFormat => '{"type":"json_object"}',
            failOnError     => false
        )
    """))
    .withColumn("invoice", from_json(
        col("ai_response.response"),
        "STRUCT<numero:STRING, total:DOUBLE, "
        "itens:ARRAY<STRUCT<codigo:STRING, descricao:STRING, qtde:DOUBLE, vlrUnit:DOUBLE>>>"
    ))
)

Pattern 6: Time Series Forecasting

SELECT *
FROM ai_forecast(
    observed  => TABLE(SELECT date, sales FROM daily_sales),
    horizon   => '2026-12-31',
    time_col  => 'date',
    value_col => 'sales'
);
-- Returns: date, sales_forecast, sales_upper, sales_lower

Reference Files

  • references/1-task-functions.md — Full syntax, parameters, SQL + PySpark examples for all 9 task-specific functions (ai_analyze_sentiment, ai_classify, ai_extract, ai_fix_grammar, ai_gen, ai_mask, ai_similarity, ai_summarize, ai_translate) and ai_parse_document
  • references/2-ai-query.mdai_query complete reference: all parameters, structured output with responseFormat, multimodal files =>, UDF patterns, and error handling
  • references/3-ai-forecast.mdai_forecast parameters, single-metric, multi-group, multi-metric, and confidence interval patterns
  • references/4-document-processing-pipeline.md — End-to-end batch document processing pipeline using AI Functions in a Lakeflow Declarative Pipeline; includes config.yml centralization, function selection logic, custom RAG pipeline (parse → chunk → Vector Search), and DSPy/LangChain guidance for near-real-time variants

Common Issues

Issue Solution
ai_parse_document not found Requires DBR 17.1+. Check cluster runtime.
ai_forecast fails Requires Pro or Serverless SQL warehouse — not available on Classic or Starter.
All functions return NULL Input column is NULL. Filter with WHERE col IS NOT NULL before calling.
ai_translate fails for a language Supported: English, German, French, Italian, Portuguese, Hindi, Spanish, Thai. Use ai_query with a multilingual model for others.
ai_classify returns unexpected labels Use clear, mutually exclusive label names. Fewer labels (2–5) produces more reliable results.
ai_query raises on some rows in a batch job Add failOnError => false — returns a STRUCT with .response and .errorMessage instead of raising.
Batch job runs slowly Use DBR 15.4 ML LTS cluster (not serverless or interactive) for optimized batch inference throughput.
Want to swap models without editing pipeline code Store all model names and prompts in config.yml — see references/4-document-processing-pipeline.md for the pattern.
Install via CLI
npx skills add https://github.com/databricks/databricks-agent-skills --skill databricks-ai-functions
Repository Details
star Stars 159
call_split Forks 49
navigation Branch main
article Path SKILL.md
More from Creator