Checks and Parsersยถ

Checksยถ

The same Check class used for pandas and polars works with xarray. The xarray backends dispatch on DataArray and Dataset.

import numpy as np
import xarray as xr
import pandera.xarray as pa

schema = pa.DataArraySchema(
    dtype=np.float64,
    checks=[
        pa.Check(lambda da: float(da.min()) >= 0),
        pa.Check(lambda da: float(da.max()) <= 100),
    ],
)

da = xr.DataArray(np.linspace(0, 50, 12), dims="x")
schema.validate(da)
<xarray.DataArray (x: 12)> Size: 96B
array([ 0.        ,  4.54545455,  9.09090909, 13.63636364, 18.18181818,
       22.72727273, 27.27272727, 31.81818182, 36.36363636, 40.90909091,
       45.45454545, 50.        ])
Dimensions without coordinates: x

Built-in checksยถ

All standard built-in checks work on xarray objects. On a DataArray they operate on the array values; on a Dataset they operate across variables.

da_numeric = xr.DataArray(np.array([10, 20, 30]), dims="x")

pa.DataArraySchema(checks=pa.Check.greater_than(0)).validate(da_numeric)
pa.DataArraySchema(checks=pa.Check.less_than(100)).validate(da_numeric)
pa.DataArraySchema(checks=pa.Check.isin([10, 20, 30])).validate(da_numeric)
pa.DataArraySchema(checks=pa.Check.notin([0, -1])).validate(da_numeric)
pa.DataArraySchema(checks=pa.Check.in_range(0, 100)).validate(da_numeric)
<xarray.DataArray (x: 3)> Size: 24B
array([10, 20, 30])
Dimensions without coordinates: x

String checks also work on string-typed arrays:

da_str = xr.DataArray(np.array(["FOO_1", "BAR_2"]), dims="x")

pa.DataArraySchema(checks=pa.Check.str_matches(r"^[A-Z]+_\d+$")).validate(da_str)
pa.DataArraySchema(checks=pa.Check.str_contains("_")).validate(da_str)
pa.DataArraySchema(checks=pa.Check.str_length(min_value=3, max_value=10)).validate(da_str)
<xarray.DataArray (x: 2)> Size: 40B
array(['FOO_1', 'BAR_2'], dtype='<U5')
Dimensions without coordinates: x

Xarray-specific checksยถ

These checks are specific to xarrayโ€™s structural model:

da_3d = xr.DataArray(
    np.ones((12, 5, 10)),
    dims=("time", "lat", "lon"),
    coords={
        "time": np.arange(12, dtype=np.float64),
        "lat": np.linspace(-90, 90, 5),
        "lon": np.linspace(-180, 180, 10),
    },
    attrs={"units": "K"},
)

pa.DataArraySchema(checks=pa.Check.has_dims(("time", "lat", "lon"))).validate(da_3d)
pa.DataArraySchema(checks=pa.Check.has_coords(("time", "lat"))).validate(da_3d)
pa.DataArraySchema(checks=pa.Check.has_attrs({"units": "K"})).validate(da_3d)
pa.DataArraySchema(checks=pa.Check.ndim(3)).validate(da_3d)
pa.DataArraySchema(checks=pa.Check.dim_size("time", 12)).validate(da_3d)
pa.DataArraySchema(checks=pa.Check.is_monotonic("time")).validate(da_3d)
pa.DataArraySchema(checks=pa.Check.no_duplicates_in_coord("time")).validate(da_3d)
<xarray.DataArray (time: 12, lat: 5, lon: 10)> Size: 5kB
array([[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],

       [[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],

       [[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],

       [[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
...
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],

       [[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],

       [[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],

       [[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]])
Coordinates:
  * time     (time) float64 96B 0.0 1.0 2.0 3.0 4.0 ... 7.0 8.0 9.0 10.0 11.0
  * lat      (lat) float64 40B -90.0 -45.0 0.0 45.0 90.0
  * lon      (lon) float64 80B -180.0 -140.0 -100.0 -60.0 ... 100.0 140.0 180.0
Attributes:
    units:    K

Encoding checkยถ

da_enc = xr.DataArray(np.ones(3), dims="x")
da_enc.encoding = {"_FillValue": -999.0, "dtype": "float32"}

pa.DataArraySchema(
    checks=pa.Check.has_encoding({"_FillValue": -999.0}),
).validate(da_enc)
<xarray.DataArray (x: 3)> Size: 24B
array([1., 1., 1.])
Dimensions without coordinates: x

See Encoding Validation for the richer schema-level encoding= parameter.

CF convention checksยถ

da_cf = xr.DataArray(
    np.ones(3), dims="x",
    attrs={"standard_name": "air_temperature", "units": "K"},
)

pa.DataArraySchema(
    checks=[
        pa.Check.cf_standard_name("air_temperature"),
        pa.Check.cf_units("K"),
    ],
).validate(da_cf)
<xarray.DataArray (x: 3)> Size: 24B
array([1., 1., 1.])
Dimensions without coordinates: x
Attributes:
    standard_name:  air_temperature
    units:          K

See CF Convention Checks for all available CF checks.

Note

Structural rules (dims, coords, sizes, attrs, โ€ฆ) are best expressed as schema keyword arguments โ€” they are validated first and produce clearer error messages. The Check.has_* helpers are useful for:

  • Dataset-level checks=[...] where you need structural assertions across the whole container.

  • Ad hoc validation where you donโ€™t want a full schema.

  • Value-level structural checks like is_monotonic and no_duplicates_in_coord that have no schema-kwarg equivalent.

Element-wise checksยถ

element_wise=True is available but less common for N-D arrays. The check function receives individual scalar values:

schema = pa.DataArraySchema(
    checks=pa.Check(lambda x: x > 0, element_wise=True),
)

da_positive = xr.DataArray([1.0, 2.0, 3.0], dims="x")
schema.validate(da_positive)
<xarray.DataArray (x: 3)> Size: 24B
array([1., 2., 3.])
Dimensions without coordinates: x

Custom checksยถ

Write any callable that accepts a DataArray (or Dataset) and returns a boolean or a boolean DataArray:

def is_normalized(da):
    return float(da.min()) >= 0 and float(da.max()) <= 1

schema = pa.DataArraySchema(checks=pa.Check(is_normalized))

da_norm = xr.DataArray(np.linspace(0, 1, 10), dims="x")
schema.validate(da_norm)
<xarray.DataArray (x: 10)> Size: 80B
array([0.        , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
       0.55555556, 0.66666667, 0.77777778, 0.88888889, 1.        ])
Dimensions without coordinates: x
da_bad = xr.DataArray(np.linspace(-1, 2, 10), dims="x")

try:
    schema.validate(da_bad)
except pa.errors.SchemaError as exc:
    print(exc)
DataArraySchema 'None' failed series or dataframe validator 0: <Check is_normalized>

Parsersยถ

Parser objects transform the data before checks run. This is useful for filling missing values, renaming, or other pre-processing:

schema = pa.DataArraySchema(
    parsers=[
        pa.Parser(lambda da: da.fillna(0)),
        pa.Parser(lambda da: da.rename("cleaned")),
    ],
    checks=pa.Check(lambda da: float(da.min()) >= 0),
)

da_messy = xr.DataArray([1.0, np.nan, 3.0], dims="x", name="raw")
validated = schema.validate(da_messy)
print(f"name: {validated.name}, values: {validated.values}")
name: cleaned, values: [1. 0. 3.]

Lazy validationยถ

By default, validate() raises SchemaError at the first failure. Pass lazy=True to collect all errors and raise a single SchemaErrors:

schema = pa.DataArraySchema(
    dtype=np.float64,
    dims=("x",),
    name="values",
    checks=pa.Check(lambda da: bool((da > 0).all())),
)

da_multi_err = xr.DataArray([-1, 2, 3], dims="x", name="wrong_name")

try:
    schema.validate(da_multi_err, lazy=True)
except pa.errors.SchemaErrors as exc:
    print(exc)
{
    "SCHEMA": {
        "WRONG_FIELD_NAME": [
            {
                "schema": "values",
                "column": "values",
                "check": "name",
                "error": "expected name 'values', got 'wrong_name'"
            }
        ],
        "WRONG_DATATYPE": [
            {
                "schema": "values",
                "column": "values",
                "check": "dtype(<class 'numpy.float64'>)",
                "error": "expected dtype <class 'numpy.float64'>, got int64"
            }
        ]
    },
    "DATA": {
        "DATAFRAME_CHECK": [
            {
                "schema": "values",
                "column": "values",
                "check": "<lambda>",
                "error": "DataArraySchema 'values' failed series or dataframe validator 0: <Check <lambda>>"
            }
        ]
    }
}

Understanding the lazy validation reportยถ

SchemaErrors attaches a nested JSON summary in exc.message (top-level SCHEMA vs DATA, then reason codes, then per-failure dicts with schema, column, check, and error). Use exc.schema_errors for the full list of SchemaError objects; element-wise check failures often expose a DataArray mask on each errorโ€™s failure_cases.

See Error reports and lazy validation for a full walkthrough, schema vs data semantics, and how to interpret failure_cases next to Lazy Validation and Error Reports.

See alsoยถ