name: dqx-end-to-end description: > Run DQX validation end-to-end — read an input table or path, apply checks, and write valid and quarantined rows to output locations — in a single call. Use when the user asks for "apply and save", "quality-check a table and split the output", "DQX on a whole table", "save valid and invalid rows", or wants to drop DQX into a Lakeflow / workflow that runs on a table or path. Covers apply_checks_and_save_in_table, the by_metadata variant, InputConfig / OutputConfig, and incremental streaming mode.
DQX — End-to-end apply + save
One method call: read, check, write valid rows, (optionally) write quarantined rows.
from databricks.labs.dqx import check_funcs
from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.rule import DQRowRule, DQDatasetRule
from databricks.labs.dqx.config import InputConfig, OutputConfig
from databricks.sdk import WorkspaceClient
# `spark` is available in Databricks notebooks / jobs. Locally, create it with
# `from pyspark.sql import SparkSession; spark = SparkSession.builder.getOrCreate()`.
dq = DQEngine(WorkspaceClient())
checks = [
DQRowRule(criticality="warn", check_func=check_funcs.is_not_null, column="col3"),
DQDatasetRule(criticality="error", check_func=check_funcs.is_unique, columns=["col1", "col2"]),
]
# Split into valid + quarantine tables
dq.apply_checks_and_save_in_table(
checks=checks,
input_config=InputConfig(location="catalog.schema.input"),
output_config=OutputConfig(location="catalog.schema.valid"),
quarantine_config=OutputConfig(location="catalog.schema.quarantine"),
)
# Or — keep everything in a single annotated output table
dq.apply_checks_and_save_in_table(
checks=checks,
input_config=InputConfig(location="catalog.schema.input"),
output_config=OutputConfig(location="catalog.schema.annotated"),
)
# Or — quarantine only: write just the invalid rows, skip the valid output table
dq.apply_checks_and_save_in_table(
checks=checks,
input_config=InputConfig(location="catalog.schema.input"),
quarantine_config=OutputConfig(location="catalog.schema.quarantine"),
)
output_config is optional, but at least one of output_config / quarantine_config must be provided — omitting both raises InvalidParameterError. Omit output_config (with quarantine_config set) to write only the invalid rows.
Metadata form (checks loaded from storage)
dq.apply_checks_by_metadata_and_save_in_table(
checks=checks_metadata, # list[dict] — see dqx-define-checks
input_config=InputConfig(location="catalog.schema.input"),
output_config=OutputConfig(location="catalog.schema.valid"),
quarantine_config=OutputConfig(location="catalog.schema.quarantine"),
)
If checks is omitted and checks_location is set on a RunConfig (workspace install flow), checks are loaded automatically from that storage — see dqx-storage.
InputConfig / OutputConfig — common options
location— Unity Catalog table (catalog.schema.table) or Unity Catalog Volume path (/Volumes/catalog/schema/volume/...).format— defaultdelta; set when writing to a volume as parquet/json/csv.options— dict passed through to reader/writer (e.g.{"mergeSchema": "true"}).mode—"append"(default for batch) or"overwrite".- Streaming — set
input_config.is_streaming=Trueto opt into incremental reads. Provideoptions={"checkpointLocation": "/Volumes/.../checkpoints/xxx"}on the output config. DQX's batch-style incremental reader uses anAvailableNow-style trigger by default; set a different trigger via the output config if you need continuous streaming. Full details: Applying Quality Checks.
Multi-table / pattern execution
apply_checks_and_save_in_tables(run_configs)— accept a list ofRunConfigand fan out over each(input, output, quarantine)triple.apply_checks_and_save_in_tables_for_patterns(patterns, checks_location, run_config_template=...)— expand wildcard patterns (Pythonlist[str], e.g.["main.product001.*", "main.product002"]) against Unity Catalog and reuse oneRunConfigtemplate for every matched table. Output / quarantine names are derived from the input name +output_table_suffix/quarantine_table_suffix(defaults_dq_output/_dq_quarantine). A template with onlyquarantine_configset (nooutput_config) writes the quarantine table per match and skips the valid output. The semicolon-delimited form is a convention of thedatabricks labs dqxCLI, not the Python API.
Do / Don't
- Do set a unique
checkpointLocationper output when using streaming — DQX uses it to track watermarks; sharing one across pipelines causes silent data loss. - Do point
quarantine_configat a separate table. Without it, failed rows get the_errorscolumn but remain in the single output table. To persist only the invalid rows, setquarantine_configand omitoutput_config. - Do use the same method signature for batch and streaming — the difference is one
is_streamingflag onInputConfig. - Don't manually pre-filter
_errors.isNull()before writing — DQX already routes error-criticality rows toquarantine_configwhen you provide it. - Don't mix
checks=andchecks_locationin aRunConfig— pick one source of truth.
Canonical docs:
- End-to-end apply patterns: https://databrickslabs.github.io/dqx/docs/guide/quality_checks_apply#applying-checks-defined-with-dqx-classes
- Config dataclasses: https://databrickslabs.github.io/dqx/docs/reference/engine
- Demos: https://databrickslabs.github.io/dqx/docs/demos