Dask and Duck ArraysΒΆ

xarray can wrap any array type that implements NumPy’s array protocol β€” Dask arrays, sparse arrays, CuPy arrays, and more. Pandera’s xarray backend validates these duck arrays with two complementary mechanisms:

  1. Schema parameters β€” chunked and array_type on DataArraySchema and DataVar.

  2. Validation depth β€” automatic or manual control over whether data-level checks trigger computation on lazy backends.

chunked β€” require or forbid lazy backingΒΆ

import numpy as np
import xarray as xr
import pandera.xarray as pa

da_eager = xr.DataArray(np.ones(10), dims="x")

pa.DataArraySchema(chunked=False).validate(da_eager)
<xarray.DataArray (x: 10)> Size: 80B
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
Dimensions without coordinates: x

When chunked=True, the underlying data must be chunked (i.e. da.chunks is not None):

da_dask = da_eager.chunk({"x": 5})

pa.DataArraySchema(chunked=True).validate(da_dask)
<xarray.DataArray (x: 10)> Size: 80B
dask.array<xarray-<this-array>, shape=(10,), dtype=float64, chunksize=(5,), chunktype=numpy.ndarray>
Dimensions without coordinates: x

Passing an eager array to a chunked=True schema raises an error:

try:
    pa.DataArraySchema(chunked=True).validate(da_eager)
except pa.errors.SchemaError as exc:
    print(exc)
expected chunked (Dask) DataArray

Set chunked=None (the default) to accept either.

array_type β€” assert the concrete storage typeΒΆ

import dask.array

pa.DataArraySchema(array_type=dask.array.Array).validate(da_dask)
pa.DataArraySchema(array_type=np.ndarray).validate(da_eager)
<xarray.DataArray (x: 10)> Size: 80B
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
Dimensions without coordinates: x

When the actual type does not match:

try:
    pa.DataArraySchema(array_type=np.ndarray).validate(da_dask)
except pa.errors.SchemaError as exc:
    print(exc)
expected array type <class 'numpy.ndarray'>, got <class 'dask.array.core.Array'>

Structural checks run without .compute()ΒΆ

Pandera classifies every validation rule into a scope:

  • Schema scope β€” dtype, dims, sizes, shape, coords, attrs, encoding, name, chunked, array_type. These inspect metadata only and never trigger computation.

  • Data scope β€” Check objects and nullable. These need actual values.

For chunked (Dask-backed) data, schema-scope checks always run. Data-scope checks are governed by ValidationDepth.

schema = pa.DataArraySchema(
    dtype=np.float64,
    dims=("x",),
    sizes={"x": 10},
    name="values",
)

da_named = da_dask.rename("values")
schema.validate(da_named)
<xarray.DataArray 'values' (x: 10)> Size: 80B
dask.array<xarray-<this-array>, shape=(10,), dtype=float64, chunksize=(5,), chunktype=numpy.ndarray>
Dimensions without coordinates: x

No .compute() was called β€” only metadata was inspected.

Validation depth with DaskΒΆ

By default, chunked arrays use SCHEMA_ONLY depth to avoid surprise computation. You can override this:

from pandera.config import ValidationDepth, config_context

schema_with_checks = pa.DataArraySchema(
    dtype=np.float64,
    dims=("x",),
    checks=pa.Check(lambda da: float(da.min()) >= 0),
)

with config_context(validation_depth=ValidationDepth.SCHEMA_AND_DATA):
    schema_with_checks.validate(da_dask)

Note

Setting SCHEMA_AND_DATA on chunked arrays will call .compute() during validation. Be mindful of memory and compute costs for large datasets.

Or set the environment variable:

export PANDERA_VALIDATION_DEPTH=SCHEMA_AND_DATA

See Configuration for the full resolution order.

Datasets with Dask-backed variablesΒΆ

chunked and array_type work on DataVar inside a DatasetSchema:

ds = xr.Dataset({
    "temperature": (("x", "y"), np.random.rand(3, 4)),
    "pressure": (("x", "y"), np.random.rand(3, 4)),
}).chunk({"x": 2})

schema = pa.DatasetSchema(
    data_vars={
        "temperature": pa.DataVar(
            dtype=np.float64, dims=("x", "y"), chunked=True,
        ),
        "pressure": pa.DataVar(
            dtype=np.float64, dims=("x", "y"), chunked=True,
        ),
    },
)
schema.validate(ds)
<xarray.Dataset> Size: 192B
Dimensions:      (x: 3, y: 4)
Dimensions without coordinates: x, y
Data variables:
    temperature  (x, y) float64 96B dask.array<chunksize=(2, 4), meta=np.ndarray>
    pressure     (x, y) float64 96B dask.array<chunksize=(2, 4), meta=np.ndarray>

Lazy error collectionΒΆ

Lazy validation (lazy=True) works with Dask arrays. Structural errors are collected without triggering computation:

import numpy as np
import xarray as xr
import pandera.xarray as pa
from pandera.config import ValidationDepth, config_context

bad_ds = xr.Dataset({
    "temperature": (("z",), np.ones(3) * -1),
}).chunk({"z": 2})

schema = pa.DatasetSchema(
    data_vars={
        "temperature": pa.DataVar(
            dtype=np.float64, dims=("x", "y"), chunked=True,
            checks=pa.Check.ge(0),
        ),
        "pressure": pa.DataVar(
            dtype=np.float64, dims=("x", "y"), chunked=True,
        ),
    },
)

try:
    schema.validate(bad_ds, lazy=True)
except pa.errors.SchemaErrors as exc:
    print(exc)
{
    "SCHEMA": {
        "COLUMN_NOT_IN_DATAFRAME": [
            {
                "schema": "schema",
                "column": "pressure",
                "check": "data_var_presence",
                "error": "missing required data_var 'pressure'"
            }
        ],
        "MISMATCH_INDEX": [
            {
                "schema": "schema",
                "column": "temperature",
                "check": "dims",
                "error": "expected ndim/dims length 2 ('x', 'y'), got 1 ('z',)"
            }
        ]
    }
}

Validating both schema- and data-level checks triggers computation on the Dask array.

with config_context(validation_depth=ValidationDepth.SCHEMA_AND_DATA):
    try:
        schema.validate(bad_ds, lazy=True)
    except pa.errors.SchemaErrors as exc:
        print(exc)
{
    "SCHEMA": {
        "COLUMN_NOT_IN_DATAFRAME": [
            {
                "schema": "schema",
                "column": "pressure",
                "check": "data_var_presence",
                "error": "missing required data_var 'pressure'"
            }
        ],
        "MISMATCH_INDEX": [
            {
                "schema": "schema",
                "column": "temperature",
                "check": "dims",
                "error": "expected ndim/dims length 2 ('x', 'y'), got 1 ('z',)"
            }
        ]
    },
    "DATA": {
        "DATAFRAME_CHECK": [
            {
                "schema": "schema",
                "column": "temperature",
                "check": "greater_than_or_equal_to(0)",
                "error": "DataArraySchema 'temperature' failed element-wise validator number 0: greater_than_or_equal_to(0) failure cases: -1.0"
            }
        ]
    }
}

See alsoΒΆ