name: dqx-define-checks description: > Create DQX quality rules (checks) for a PySpark DataFrame or Delta table. Use when the user asks to "add a DQX check", "define a data quality rule", "validate that column X is not null / unique / in a set", or wants checks expressed in YAML/JSON for storage. Covers DQRowRule, DQDatasetRule, DQForEachColRule, built-in check_funcs, filters, user_metadata, custom SQL/Python checks, and the declarative metadata form.
DQX — Define quality checks
DQX rules come in two interchangeable forms. Pick based on where the checks will live.
- DQX classes (
DQRowRule,DQDatasetRule,DQForEachColRule) — use when checks are authored in code next to the pipeline. Static typing + IDE autocomplete. - Dict / YAML / JSON metadata — use when checks are loaded from a file, workspace path, volume, or Delta table. Required for the
apply_checks_by_metadata*path.
Every check has a criticality of error (failing row quarantined) or warn (failing row passes but flagged). Default is error.
Minimal — class form
from databricks.labs.dqx import check_funcs
from databricks.labs.dqx.rule import DQRowRule, DQDatasetRule, DQForEachColRule
checks = [
# row-level: one column
DQRowRule(
name="col3_is_not_null",
criticality="warn",
check_func=check_funcs.is_not_null_and_not_empty,
column="col3",
),
# same check across many columns
*DQForEachColRule(
columns=["col1", "col2"],
criticality="error",
check_func=check_funcs.is_not_null,
).get_rules(),
# dataset-level: uniqueness across a composite key
DQDatasetRule(
criticality="error",
check_func=check_funcs.is_unique,
columns=["order_id", "line_item_id"],
),
]
Minimal — metadata form (YAML)
Load into Python via yaml.safe_load(...), then pass the resulting list[dict] to any apply_checks_by_metadata* call, or save through a storage config (see dqx-storage).
- name: col3_is_not_null
criticality: warn
check:
function: is_not_null_and_not_empty
arguments:
column: col3
- criticality: error
check:
function: is_not_null
for_each_column: [col1, col2]
- criticality: error
check:
function: is_unique
arguments:
columns: [order_id, line_item_id]
Common variants
- Filtered check — evaluate only when a predicate holds: add
filter="col1 < 3"(class) orfilter: "col1 < 3"(YAML). - Positional args —
check_func_args=[[1, 2]]; keyword args —check_func_kwargs={"allowed": [1, 2]}. - Struct / map / array element — use
F.try_element_at(...)or dotted path (col7.field1) as thecolumnvalue. - User metadata — annotate the rule with a
user_metadatadict (e.g.{"check_type": "completeness"}) that flows into the result struct. - Custom check — pass any function that returns a PySpark
Columnascheck_func. For inline SQL, use the fallback section below — only after confirming no built-in fits. - Aggregate dataset-level —
is_aggr_not_greater_than,is_aggr_not_less_than,is_aggr_equal,is_aggr_not_equal; supplyaggr_type(count,avg,stddev,percentile,count_distinct…), optionalgroup_by, andlimit. - Uniqueness dataset-level —
is_unique, withcolumns,nulls_distinct(bool), and optionalrow_filter. Not an aggregate check — noaggr_type.
Full reference: https://databrickslabs.github.io/dqx/docs/reference/quality_checks.
Fallback: custom SQL
Search check_funcs first — the built-ins cover null/empty, range, set membership, regex, referential, aggregate, uniqueness, schema, freshness, comparison, and outlier cases with typed error messages and tested edge handling. Drop down to SQL only when no built-in fits.
sql_expression— row-level SQL boolean expression. Use when one row's validity depends on its own columns.sql_query— dataset-level SQL query against{{ input_view }}. Use for cross-row aggregates, joins to reference DataFrames, or anything needingGROUP BY. Queries are validated byis_sql_query_safe()— read-onlySELECT, no DDL/DML.
# row-level: SQL expression evaluated per row
- name: amount_positive_or_refunded
criticality: error
check:
function: sql_expression
arguments:
expression: amount > 0 OR refunded = true
msg: amount must be positive unless refunded
# dataset-level: SQL query, joined back to rows via merge_columns
- name: order_total_matches_lines
criticality: error
check:
function: sql_query
arguments:
query: |
SELECT order_id,
SUM(line_amount) <> order_total AS condition
FROM {{ input_view }}
GROUP BY order_id, order_total
merge_columns: [order_id] # row-level: joins back per order_id
condition_column: condition # column in query output; true = fail
# omit merge_columns for dataset-level (one verdict applies to every row)
For the equivalent class form, use DQRowRule(check_func=check_funcs.sql_expression, check_func_kwargs={"expression": "..."}) or DQDatasetRule(check_func=check_funcs.sql_query, check_func_kwargs={...}).
Converting between forms
from databricks.labs.dqx.checks_serializer import serialize_checks, deserialize_checks
checks_metadata = serialize_checks(checks) # classes → list[dict]
checks_classes = deserialize_checks(checks_metadata) # list[dict] → classes
Validation before apply
Catch syntax errors without running the pipeline:
from databricks.labs.dqx.engine import DQEngine
status = DQEngine.validate_checks(checks) # raises / returns ValidationStatus
Do / Don't
- Do give each rule a stable, snake_case
name— it ends up in result columns and dashboards. - Do put shared rules in a YAML/JSON/Delta file and load them (see
dqx-storage) — classes are fine for a handful, metadata scales. - Don't use a regex check for null / empty / range / referential / aggregate cases — DQX has a built-in for each; see
check_funcs. - Don't reach for
sql_expression/sql_querywhen a built-in covers the case — they bypass typed error messages and security guards. Searchcheck_funcsfirst. - Don't put side effects in a custom
check_func— it must return aColumnexpression only.
Canonical docs: https://databrickslabs.github.io/dqx/docs/guide/quality_checks_definition.