Dask and Duck ArraysΒΆ
xarray can wrap any array type that implements NumPyβs array protocol β Dask arrays, sparse arrays, CuPy arrays, and more. Panderaβs xarray backend validates these duck arrays with two complementary mechanisms:
Schema parameters β
chunkedandarray_typeonDataArraySchemaandDataVar.Validation depth β automatic or manual control over whether data-level checks trigger computation on lazy backends.
chunked β require or forbid lazy backingΒΆ
import numpy as np
import xarray as xr
import pandera.xarray as pa
da_eager = xr.DataArray(np.ones(10), dims="x")
pa.DataArraySchema(chunked=False).validate(da_eager)
<xarray.DataArray (x: 10)> Size: 80B array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]) Dimensions without coordinates: x
When chunked=True, the underlying data must be chunked (i.e.
da.chunks is not None):
da_dask = da_eager.chunk({"x": 5})
pa.DataArraySchema(chunked=True).validate(da_dask)
<xarray.DataArray (x: 10)> Size: 80B dask.array<xarray-<this-array>, shape=(10,), dtype=float64, chunksize=(5,), chunktype=numpy.ndarray> Dimensions without coordinates: x
Passing an eager array to a chunked=True schema raises an error:
try:
pa.DataArraySchema(chunked=True).validate(da_eager)
except pa.errors.SchemaError as exc:
print(exc)
expected chunked (Dask) DataArray
Set chunked=None (the default) to accept either.
array_type β assert the concrete storage typeΒΆ
import dask.array
pa.DataArraySchema(array_type=dask.array.Array).validate(da_dask)
pa.DataArraySchema(array_type=np.ndarray).validate(da_eager)
<xarray.DataArray (x: 10)> Size: 80B array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]) Dimensions without coordinates: x
When the actual type does not match:
try:
pa.DataArraySchema(array_type=np.ndarray).validate(da_dask)
except pa.errors.SchemaError as exc:
print(exc)
expected array type <class 'numpy.ndarray'>, got <class 'dask.array.core.Array'>
Structural checks run without .compute()ΒΆ
Pandera classifies every validation rule into a scope:
Schema scope β dtype, dims, sizes, shape, coords, attrs, encoding, name,
chunked,array_type. These inspect metadata only and never trigger computation.Data scope β
Checkobjects andnullable. These need actual values.
For chunked (Dask-backed) data, schema-scope checks always run. Data-scope
checks are governed by ValidationDepth.
schema = pa.DataArraySchema(
dtype=np.float64,
dims=("x",),
sizes={"x": 10},
name="values",
)
da_named = da_dask.rename("values")
schema.validate(da_named)
<xarray.DataArray 'values' (x: 10)> Size: 80B dask.array<xarray-<this-array>, shape=(10,), dtype=float64, chunksize=(5,), chunktype=numpy.ndarray> Dimensions without coordinates: x
No .compute() was called β only metadata was inspected.
Validation depth with DaskΒΆ
By default, chunked arrays use SCHEMA_ONLY depth to avoid surprise
computation. You can override this:
from pandera.config import ValidationDepth, config_context
schema_with_checks = pa.DataArraySchema(
dtype=np.float64,
dims=("x",),
checks=pa.Check(lambda da: float(da.min()) >= 0),
)
with config_context(validation_depth=ValidationDepth.SCHEMA_AND_DATA):
schema_with_checks.validate(da_dask)
Note
Setting SCHEMA_AND_DATA on chunked arrays will call .compute() during
validation. Be mindful of memory and compute costs for large datasets.
Or set the environment variable:
export PANDERA_VALIDATION_DEPTH=SCHEMA_AND_DATA
See Configuration for the full resolution order.
Datasets with Dask-backed variablesΒΆ
chunked and array_type work on
DataVar inside a
DatasetSchema:
ds = xr.Dataset({
"temperature": (("x", "y"), np.random.rand(3, 4)),
"pressure": (("x", "y"), np.random.rand(3, 4)),
}).chunk({"x": 2})
schema = pa.DatasetSchema(
data_vars={
"temperature": pa.DataVar(
dtype=np.float64, dims=("x", "y"), chunked=True,
),
"pressure": pa.DataVar(
dtype=np.float64, dims=("x", "y"), chunked=True,
),
},
)
schema.validate(ds)
<xarray.Dataset> Size: 192B
Dimensions: (x: 3, y: 4)
Dimensions without coordinates: x, y
Data variables:
temperature (x, y) float64 96B dask.array<chunksize=(2, 4), meta=np.ndarray>
pressure (x, y) float64 96B dask.array<chunksize=(2, 4), meta=np.ndarray>Lazy error collectionΒΆ
Lazy validation (lazy=True) works with Dask arrays. Structural errors are
collected without triggering computation:
import numpy as np
import xarray as xr
import pandera.xarray as pa
from pandera.config import ValidationDepth, config_context
bad_ds = xr.Dataset({
"temperature": (("z",), np.ones(3) * -1),
}).chunk({"z": 2})
schema = pa.DatasetSchema(
data_vars={
"temperature": pa.DataVar(
dtype=np.float64, dims=("x", "y"), chunked=True,
checks=pa.Check.ge(0),
),
"pressure": pa.DataVar(
dtype=np.float64, dims=("x", "y"), chunked=True,
),
},
)
try:
schema.validate(bad_ds, lazy=True)
except pa.errors.SchemaErrors as exc:
print(exc)
{
"SCHEMA": {
"COLUMN_NOT_IN_DATAFRAME": [
{
"schema": "schema",
"column": "pressure",
"check": "data_var_presence",
"error": "missing required data_var 'pressure'"
}
],
"MISMATCH_INDEX": [
{
"schema": "schema",
"column": "temperature",
"check": "dims",
"error": "expected ndim/dims length 2 ('x', 'y'), got 1 ('z',)"
}
]
}
}
Validating both schema- and data-level checks triggers computation on the Dask array.
with config_context(validation_depth=ValidationDepth.SCHEMA_AND_DATA):
try:
schema.validate(bad_ds, lazy=True)
except pa.errors.SchemaErrors as exc:
print(exc)
{
"SCHEMA": {
"COLUMN_NOT_IN_DATAFRAME": [
{
"schema": "schema",
"column": "pressure",
"check": "data_var_presence",
"error": "missing required data_var 'pressure'"
}
],
"MISMATCH_INDEX": [
{
"schema": "schema",
"column": "temperature",
"check": "dims",
"error": "expected ndim/dims length 2 ('x', 'y'), got 1 ('z',)"
}
]
},
"DATA": {
"DATAFRAME_CHECK": [
{
"schema": "schema",
"column": "temperature",
"check": "greater_than_or_equal_to(0)",
"error": "DataArraySchema 'temperature' failed element-wise validator number 0: greater_than_or_equal_to(0) failure cases: -1.0"
}
]
}
}
See alsoΒΆ
Configuration β validation depth resolution, disabling validation
DataArray Schemas β
DataArraySchemaparametersDataset Schemas β
DatasetSchemaandDataVarChecks and Parsers β checks, parsers, lazy validation