Checks and Parsersยถ
Checksยถ
The same Check class used for pandas and polars
works with xarray. The xarray backends dispatch on
DataArray and Dataset.
import numpy as np
import xarray as xr
import pandera.xarray as pa
schema = pa.DataArraySchema(
dtype=np.float64,
checks=[
pa.Check(lambda da: float(da.min()) >= 0),
pa.Check(lambda da: float(da.max()) <= 100),
],
)
da = xr.DataArray(np.linspace(0, 50, 12), dims="x")
schema.validate(da)
<xarray.DataArray (x: 12)> Size: 96B
array([ 0. , 4.54545455, 9.09090909, 13.63636364, 18.18181818,
22.72727273, 27.27272727, 31.81818182, 36.36363636, 40.90909091,
45.45454545, 50. ])
Dimensions without coordinates: xBuilt-in checksยถ
All standard built-in checks work on xarray objects. On a DataArray they
operate on the array values; on a Dataset they operate across variables.
da_numeric = xr.DataArray(np.array([10, 20, 30]), dims="x")
pa.DataArraySchema(checks=pa.Check.greater_than(0)).validate(da_numeric)
pa.DataArraySchema(checks=pa.Check.less_than(100)).validate(da_numeric)
pa.DataArraySchema(checks=pa.Check.isin([10, 20, 30])).validate(da_numeric)
pa.DataArraySchema(checks=pa.Check.notin([0, -1])).validate(da_numeric)
pa.DataArraySchema(checks=pa.Check.in_range(0, 100)).validate(da_numeric)
<xarray.DataArray (x: 3)> Size: 24B array([10, 20, 30]) Dimensions without coordinates: x
String checks also work on string-typed arrays:
da_str = xr.DataArray(np.array(["FOO_1", "BAR_2"]), dims="x")
pa.DataArraySchema(checks=pa.Check.str_matches(r"^[A-Z]+_\d+$")).validate(da_str)
pa.DataArraySchema(checks=pa.Check.str_contains("_")).validate(da_str)
pa.DataArraySchema(checks=pa.Check.str_length(min_value=3, max_value=10)).validate(da_str)
<xarray.DataArray (x: 2)> Size: 40B array(['FOO_1', 'BAR_2'], dtype='<U5') Dimensions without coordinates: x
Xarray-specific checksยถ
These checks are specific to xarrayโs structural model:
da_3d = xr.DataArray(
np.ones((12, 5, 10)),
dims=("time", "lat", "lon"),
coords={
"time": np.arange(12, dtype=np.float64),
"lat": np.linspace(-90, 90, 5),
"lon": np.linspace(-180, 180, 10),
},
attrs={"units": "K"},
)
pa.DataArraySchema(checks=pa.Check.has_dims(("time", "lat", "lon"))).validate(da_3d)
pa.DataArraySchema(checks=pa.Check.has_coords(("time", "lat"))).validate(da_3d)
pa.DataArraySchema(checks=pa.Check.has_attrs({"units": "K"})).validate(da_3d)
pa.DataArraySchema(checks=pa.Check.ndim(3)).validate(da_3d)
pa.DataArraySchema(checks=pa.Check.dim_size("time", 12)).validate(da_3d)
pa.DataArraySchema(checks=pa.Check.is_monotonic("time")).validate(da_3d)
pa.DataArraySchema(checks=pa.Check.no_duplicates_in_coord("time")).validate(da_3d)
<xarray.DataArray (time: 12, lat: 5, lon: 10)> Size: 5kB
array([[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],
[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],
[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],
[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
...
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],
[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],
[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],
[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]])
Coordinates:
* time (time) float64 96B 0.0 1.0 2.0 3.0 4.0 ... 7.0 8.0 9.0 10.0 11.0
* lat (lat) float64 40B -90.0 -45.0 0.0 45.0 90.0
* lon (lon) float64 80B -180.0 -140.0 -100.0 -60.0 ... 100.0 140.0 180.0
Attributes:
units: KEncoding checkยถ
da_enc = xr.DataArray(np.ones(3), dims="x")
da_enc.encoding = {"_FillValue": -999.0, "dtype": "float32"}
pa.DataArraySchema(
checks=pa.Check.has_encoding({"_FillValue": -999.0}),
).validate(da_enc)
<xarray.DataArray (x: 3)> Size: 24B array([1., 1., 1.]) Dimensions without coordinates: x
See Encoding Validation for the richer schema-level encoding= parameter.
CF convention checksยถ
da_cf = xr.DataArray(
np.ones(3), dims="x",
attrs={"standard_name": "air_temperature", "units": "K"},
)
pa.DataArraySchema(
checks=[
pa.Check.cf_standard_name("air_temperature"),
pa.Check.cf_units("K"),
],
).validate(da_cf)
<xarray.DataArray (x: 3)> Size: 24B
array([1., 1., 1.])
Dimensions without coordinates: x
Attributes:
standard_name: air_temperature
units: KSee CF Convention Checks for all available CF checks.
Note
Structural rules (dims, coords, sizes, attrs, โฆ) are best expressed
as schema keyword arguments โ they are validated first and produce clearer
error messages. The Check.has_* helpers are useful for:
Dataset-level
checks=[...]where you need structural assertions across the whole container.Ad hoc validation where you donโt want a full schema.
Value-level structural checks like
is_monotonicandno_duplicates_in_coordthat have no schema-kwarg equivalent.
Element-wise checksยถ
element_wise=True is available but less common for N-D arrays. The
check function receives individual scalar values:
schema = pa.DataArraySchema(
checks=pa.Check(lambda x: x > 0, element_wise=True),
)
da_positive = xr.DataArray([1.0, 2.0, 3.0], dims="x")
schema.validate(da_positive)
<xarray.DataArray (x: 3)> Size: 24B array([1., 2., 3.]) Dimensions without coordinates: x
Custom checksยถ
Write any callable that accepts a DataArray (or Dataset) and returns a
boolean or a boolean DataArray:
def is_normalized(da):
return float(da.min()) >= 0 and float(da.max()) <= 1
schema = pa.DataArraySchema(checks=pa.Check(is_normalized))
da_norm = xr.DataArray(np.linspace(0, 1, 10), dims="x")
schema.validate(da_norm)
<xarray.DataArray (x: 10)> Size: 80B
array([0. , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
0.55555556, 0.66666667, 0.77777778, 0.88888889, 1. ])
Dimensions without coordinates: xda_bad = xr.DataArray(np.linspace(-1, 2, 10), dims="x")
try:
schema.validate(da_bad)
except pa.errors.SchemaError as exc:
print(exc)
DataArraySchema 'None' failed series or dataframe validator 0: <Check is_normalized>
Parsersยถ
Parser objects transform the data before
checks run. This is useful for filling missing values, renaming, or other
pre-processing:
schema = pa.DataArraySchema(
parsers=[
pa.Parser(lambda da: da.fillna(0)),
pa.Parser(lambda da: da.rename("cleaned")),
],
checks=pa.Check(lambda da: float(da.min()) >= 0),
)
da_messy = xr.DataArray([1.0, np.nan, 3.0], dims="x", name="raw")
validated = schema.validate(da_messy)
print(f"name: {validated.name}, values: {validated.values}")
name: cleaned, values: [1. 0. 3.]
Lazy validationยถ
By default, validate() raises SchemaError at the
first failure. Pass lazy=True to collect all errors and raise a single
SchemaErrors:
schema = pa.DataArraySchema(
dtype=np.float64,
dims=("x",),
name="values",
checks=pa.Check(lambda da: bool((da > 0).all())),
)
da_multi_err = xr.DataArray([-1, 2, 3], dims="x", name="wrong_name")
try:
schema.validate(da_multi_err, lazy=True)
except pa.errors.SchemaErrors as exc:
print(exc)
{
"SCHEMA": {
"WRONG_FIELD_NAME": [
{
"schema": "values",
"column": "values",
"check": "name",
"error": "expected name 'values', got 'wrong_name'"
}
],
"WRONG_DATATYPE": [
{
"schema": "values",
"column": "values",
"check": "dtype(<class 'numpy.float64'>)",
"error": "expected dtype <class 'numpy.float64'>, got int64"
}
]
},
"DATA": {
"DATAFRAME_CHECK": [
{
"schema": "values",
"column": "values",
"check": "<lambda>",
"error": "DataArraySchema 'values' failed series or dataframe validator 0: <Check <lambda>>"
}
]
}
}
Understanding the lazy validation reportยถ
SchemaErrors attaches a nested JSON summary in
exc.message (top-level SCHEMA vs DATA, then reason codes, then
per-failure dicts with schema, column, check, and error). Use
exc.schema_errors for the full list of SchemaError
objects; element-wise check failures often expose a
DataArray mask on each errorโs failure_cases.
See Error reports and lazy validation for a full walkthrough, schema vs data
semantics, and how to interpret failure_cases next to Lazy Validation
and Error Reports.
See alsoยถ
DataArray Schemas โ
DataArraySchemadetailsDataset Schemas โ
DatasetSchemadetailsData Models โ class-based
DataArrayModel/DatasetModelEncoding Validation โ encoding validation (netCDF/Zarr metadata)
Error reports and lazy validation โ lazy validation reports and failure cases
CF Convention Checks โ CF convention checks
Dask and Duck Arrays โ Dask integration and validation depth
Decorators โ
check_input,check_output,check_io, andcheck_typesConfiguration โ
ValidationDepth,ValidationScope, Dask, and environment variablesXarray โ full API reference for all xarray classes
Validating with Checks โ general
Checkbehaviour (pandas-oriented)Lazy Validation โ detailed lazy validation docs