DataArray Schemasยถ
DataArraySchema validates a single
DataArray. It is the xarray counterpart of
SeriesSchema โ but for arbitrary-rank
labelled arrays rather than 1-D series.
You can also express the same constraints with the declarative DataArrayModel.
Basic usageยถ
import numpy as np
import xarray as xr
import pandera.xarray as pa
schema = pa.DataArraySchema(
dtype=np.float64,
dims=("x", "y"),
name="temperature",
)
da = xr.DataArray(
np.random.rand(3, 4),
dims=("x", "y"),
name="temperature",
)
schema.validate(da)
<xarray.DataArray 'temperature' (x: 3, y: 4)> Size: 96B
array([[0.91547261, 0.72095847, 0.18207123, 0.43649667],
[0.8363929 , 0.80296296, 0.18692264, 0.51210528],
[0.18127073, 0.72020102, 0.42334195, 0.6468169 ]])
Dimensions without coordinates: x, yDtype validationยถ
The dtype is resolved through NumPyโs type hierarchy. You can pass a
Python type, a NumPy dtype, or a string alias:
da_float32 = xr.DataArray(np.zeros(3, dtype=np.float32), dims="x")
pa.DataArraySchema(dtype=float).validate(da)
pa.DataArraySchema(dtype=np.float32).validate(da_float32)
pa.DataArraySchema(dtype="float32").validate(da_float32)
<xarray.DataArray (x: 3)> Size: 12B array([0., 0., 0.], dtype=float32) Dimensions without coordinates: x
If dtype is None, any dtype is accepted.
Dimension validationยถ
dims enforces dimension names in order. None entries act as
wildcards that match any name:
pa.DataArraySchema(dims=("x", "y")).validate(da)
pa.DataArraySchema(dims=("x", None)).validate(da)
<xarray.DataArray 'temperature' (x: 3, y: 4)> Size: 96B
array([[0.91547261, 0.72095847, 0.18207123, 0.43649667],
[0.8363929 , 0.80296296, 0.18692264, 0.51210528],
[0.18127073, 0.72020102, 0.42334195, 0.6468169 ]])
Dimensions without coordinates: x, yThe tuple length also constrains the rank (ndim).
try:
pa.DataArraySchema(dims=("x", "y", "z")).validate(da)
except pa.errors.SchemaError as exc:
print(exc)
expected ndim/dims length 3 ('x', 'y', 'z'), got 2 ('x', 'y')
Sizes and shapeยถ
sizes is the idiomatic xarray way to constrain dimension lengths.
shape does the same thing positionally. They are mutually exclusive.
da_sized = xr.DataArray(
np.zeros((12, 180, 360)),
dims=("time", "lat", "lon"),
)
pa.DataArraySchema(
dims=("time", "lat", "lon"),
sizes={"lat": 180, "lon": 360},
).validate(da_sized)
pa.DataArraySchema(
dims=("time", "lat", "lon"),
shape=(None, 180, 360),
).validate(da_sized)
<xarray.DataArray (time: 12, lat: 180, lon: 360)> Size: 6MB
array([[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
...
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]], shape=(12, 180, 360))
Dimensions without coordinates: time, lat, lonCoordinate validationยถ
Pass a dict[str, Coordinate] to validate coordinate arrays, or a
list[str] as shorthand for โthese coordinates must existโ:
da_with_coords = xr.DataArray(
np.random.rand(3, 4),
dims=("x", "y"),
coords={
"x": np.arange(3, dtype=np.float64),
"y": np.arange(4, dtype=np.float64),
"label": ("x", ["a", "b", "c"]),
},
)
schema = pa.DataArraySchema(
dims=("x", "y"),
coords={
"x": pa.Coordinate(dtype=np.float64, dimension=True),
"y": pa.Coordinate(dtype=np.float64, dimension=True),
"label": pa.Coordinate(dimension=False),
},
)
schema.validate(da_with_coords)
<xarray.DataArray (x: 3, y: 4)> Size: 96B
array([[0.99219903, 0.00650998, 0.46675993, 0.40539898],
[0.7469778 , 0.9323137 , 0.31524351, 0.42855526],
[0.88666229, 0.98565366, 0.65992507, 0.66354354]])
Coordinates:
* x (x) float64 24B 0.0 1.0 2.0
label (x) <U1 12B 'a' 'b' 'c'
* y (y) float64 32B 0.0 1.0 2.0 3.0Coordinate is documented in
detail under Dataset Schemas.
Strict coordinatesยถ
With strict_coords=True, the schema fails if the DataArray has
coordinates not listed in coords:
strict_schema = pa.DataArraySchema(
coords={"x": pa.Coordinate()},
strict_coords=True,
)
da_x_only = xr.DataArray(
np.ones(3),
dims="x",
coords={"x": np.arange(3, dtype=np.float64)},
)
strict_schema.validate(da_x_only)
<xarray.DataArray (x: 3)> Size: 24B array([1., 1., 1.]) Coordinates: * x (x) float64 24B 0.0 1.0 2.0
try:
strict_schema.validate(da_with_coords)
except pa.errors.SchemaError as exc:
print(exc)
unexpected coordinate 'y'
Attribute validationยถ
attrs validates the DataArrayโs .attrs dict. Each value in the schemaโs
attrs dict determines how the corresponding attribute is checked:
Literal values โ matched by equality (
==).Regex patterns โ strings that start with
^are treated as regular expressions and matched againststr(actual_value)viare.fullmatch.Callable predicates โ any callable
(value) -> boolis invoked with the actual attribute value; validation passes when the function returnsTrue.Pydantic model โ pass a
pydantic.BaseModelclass to validate the full attrs dict using pydanticโs type system.
Equality matchingยถ
da_attrs = xr.DataArray(
np.ones(3), dims="x",
attrs={"units": "K", "standard_name": "air_temperature"},
)
pa.DataArraySchema(
attrs={"units": "K", "standard_name": "air_temperature"},
).validate(da_attrs)
<xarray.DataArray (x: 3)> Size: 24B
array([1., 1., 1.])
Dimensions without coordinates: x
Attributes:
units: K
standard_name: air_temperatureRegex matchingยถ
Use a regex pattern (starting with ^) to validate an attribute against a
set of acceptable values:
schema = pa.DataArraySchema(
attrs={"units": "^(K|degC|degF)$"},
)
da_units = xr.DataArray(
np.ones(3), dims="x",
attrs={"units": "K"},
)
schema.validate(da_units)
<xarray.DataArray (x: 3)> Size: 24B
array([1., 1., 1.])
Dimensions without coordinates: x
Attributes:
units: Kda_bad_units = xr.DataArray(
np.ones(3), dims="x",
attrs={"units": "meters"},
)
try:
schema.validate(da_bad_units)
except pa.errors.SchemaError as exc:
print(exc)
attribute mismatch 'units': expected '^(K|degC|degF)$', got 'meters'
Callable predicatesยถ
Pass a function that receives the attribute value and returns a boolean:
schema = pa.DataArraySchema(
attrs={
"version": lambda v: isinstance(v, int) and v >= 2,
},
)
da_v3 = xr.DataArray(
np.ones(3), dims="x",
attrs={"version": 3},
)
schema.validate(da_v3)
<xarray.DataArray (x: 3)> Size: 24B
array([1., 1., 1.])
Dimensions without coordinates: x
Attributes:
version: 3da_v1 = xr.DataArray(
np.ones(3), dims="x",
attrs={"version": 1},
)
try:
schema.validate(da_v1)
except pa.errors.SchemaError as exc:
print(exc)
attribute mismatch 'version': expected <function <lambda> at 0x7e717f651940>, got 1
Pydantic modelยถ
For complex attribute schemas you can pass a pydantic.BaseModel
class instead of a dict. Pandera delegates validation to pydantic and
converts every pydantic error into a pandera SchemaError, so error
collection during lazy validation works seamlessly:
from pydantic import BaseModel, Field as PydanticField
class ArrayAttrs(BaseModel):
units: str
standard_name: str
version: int = PydanticField(ge=2)
schema = pa.DataArraySchema(attrs=ArrayAttrs)
da_ok = xr.DataArray(
np.ones(3), dims="x",
attrs={"units": "K", "standard_name": "air_temperature", "version": 3},
)
schema.validate(da_ok)
<xarray.DataArray (x: 3)> Size: 24B
array([1., 1., 1.])
Dimensions without coordinates: x
Attributes:
units: K
standard_name: air_temperature
version: 3When validation fails, the error messages surface the pydantic error details:
da_bad = xr.DataArray(
np.ones(3), dims="x",
attrs={"units": "K", "version": 1}, # version < 2, standard_name missing
)
try:
schema.validate(da_bad, lazy=True)
except pa.errors.SchemaErrors as exc:
print(exc)
{
"SCHEMA": {
"SCHEMA_COMPONENT_CHECK": [
{
"schema": "schema",
"column": null,
"check": "attrs",
"error": "standard_name: Field required [type=missing]"
},
{
"schema": "schema",
"column": null,
"check": "attrs",
"error": "version: Input should be greater than or equal to 2 [type=greater_than_equal]"
}
]
}
}
All four modes also work on
DatasetSchema โ see
Dataset Schemas.
Strict attributesยถ
With strict_attrs=True, extra attributes cause a validation error.
When attrs is a pydantic model class, the set of allowed keys is derived
from the modelโs fields.
da_extra = xr.DataArray(
np.ones(3), dims="x",
attrs={"units": "K", "extra_key": 42},
)
try:
pa.DataArraySchema(
attrs={"units": "K"}, strict_attrs=True
).validate(da_extra)
except pa.errors.SchemaError as exc:
print(exc)
unexpected attribute 'extra_key'
Name validationยถ
named_da = xr.DataArray(np.ones(3), dims="x", name="temperature")
pa.DataArraySchema(name="temperature").validate(named_da)
<xarray.DataArray 'temperature' (x: 3)> Size: 24B array([1., 1., 1.]) Dimensions without coordinates: x
The DataArrayโs .name must match exactly.
try:
pa.DataArraySchema(name="pressure").validate(named_da)
except pa.errors.SchemaError as exc:
print(exc)
expected name 'pressure', got 'temperature'
Null valuesยถ
By default nullable=False โ any NaN or null value raises a
SchemaError. Set nullable=True to allow them:
da_with_nan = xr.DataArray([1.0, np.nan, 3.0], dims="x")
pa.DataArraySchema(dtype=float, nullable=True).validate(da_with_nan)
<xarray.DataArray (x: 3)> Size: 24B array([ 1., nan, 3.]) Dimensions without coordinates: x
try:
pa.DataArraySchema(dtype=float, nullable=False).validate(da_with_nan)
except pa.errors.SchemaError as exc:
print(exc)
non-nullable DataArray contains null values
Coercing dtypesยถ
When coerce=True, the DataArray is cast to dtype before validation:
schema = pa.DataArraySchema(dtype=np.float32, coerce=True)
da_int = xr.DataArray(np.array([1, 2, 3]), dims="x")
validated = schema.validate(da_int)
print(f"original: {da_int.dtype} -> coerced: {validated.dtype}")
original: int64 -> coerced: float32
Encoding validationยถ
The encoding parameter validates the DataArrayโs .encoding dict, which is
populated when reading from netCDF or Zarr:
da_encoded = xr.DataArray(np.ones(3), dims="x")
da_encoded.encoding = {"_FillValue": -999.0, "dtype": "float32"}
pa.DataArraySchema(
encoding={"_FillValue": -999.0, "dtype": "^float.*"},
).validate(da_encoded)
<xarray.DataArray (x: 3)> Size: 24B array([1., 1., 1.]) Dimensions without coordinates: x
Encoding supports the same matching modes as attrs (equality, regex,
callable) plus pydantic models. See Encoding Validation for full details.
Chunked / array typeยถ
Control whether the underlying storage is lazy (Dask) or eager (NumPy):
pa.DataArraySchema(chunked=True) # must be Dask-backed
pa.DataArraySchema(chunked=False) # must be eager
pa.DataArraySchema(array_type=np.ndarray) # must be a numpy array
See Dask and Duck Arrays for comprehensive Dask integration
documentation, and Configuration for how chunked interacts
with validation depth.
Data-level checksยถ
Use Check for value-level assertions:
schema = pa.DataArraySchema(
dtype=np.float64,
checks=[
pa.Check(lambda da: float(da.min()) >= 0),
pa.Check(lambda da: float(da.max()) <= 100),
],
)
da_checked = xr.DataArray(np.linspace(0, 50, 10), dims="x")
schema.validate(da_checked)
<xarray.DataArray (x: 10)> Size: 80B
array([ 0. , 5.55555556, 11.11111111, 16.66666667, 22.22222222,
27.77777778, 33.33333333, 38.88888889, 44.44444444, 50. ])
Dimensions without coordinates: xSee Checks and Parsers for built-in check helpers and details on how checks interact with lazy / chunked data.
Parsersยถ
Parser objects run before checks and can
transform the array:
schema = pa.DataArraySchema(
parsers=pa.Parser(lambda da: da.fillna(0)),
nullable=False,
)
da_nulls = xr.DataArray([1.0, np.nan, 3.0], dims="x")
validated = schema.validate(da_nulls)
validated
<xarray.DataArray (x: 3)> Size: 24B array([1., 0., 3.]) Dimensions without coordinates: x
Validation optionsยถ
schema.validate(da) accepts several keyword arguments:
lazyโ collect all failures intoSchemaErrorsinstead of raising on the first one.head/tail/sampleโ subsample along the first dimension before running heavy checks.inplaceโ ifTrue, coercion mutates the original object.
schema = pa.DataArraySchema(
dtype=np.float64,
dims=("x",),
name="values",
checks=pa.Check(lambda da: bool((da > 0).all())),
)
da_bad = xr.DataArray([-1, 2, 3], dims="x", name="wrong_name")
try:
schema.validate(da_bad, lazy=True)
except pa.errors.SchemaErrors as exc:
print(exc)
{
"SCHEMA": {
"WRONG_FIELD_NAME": [
{
"schema": "values",
"column": "values",
"check": "name",
"error": "expected name 'values', got 'wrong_name'"
}
],
"WRONG_DATATYPE": [
{
"schema": "values",
"column": "values",
"check": "dtype(<class 'numpy.float64'>)",
"error": "expected dtype <class 'numpy.float64'>, got int64"
}
]
},
"DATA": {
"DATAFRAME_CHECK": [
{
"schema": "values",
"column": "values",
"check": "<lambda>",
"error": "DataArraySchema 'values' failed series or dataframe validator 0: <Check <lambda>>"
}
]
}
}
See alsoยถ
Dataset Schemas โ
DatasetSchemafor multi-variable dataData Models โ class-based
DataArrayModelChecks and Parsers โ checks, parsers, lazy validation
Encoding Validation โ encoding validation (netCDF/Zarr metadata)
Dask and Duck Arrays โ Dask integration,
chunked, andarray_typeCF Convention Checks โ CF convention checks
Decorators โ
check_input,check_output,check_io, andcheck_typesConfiguration โ
ValidationDepth,ValidationScope, Dask, environment variablesXarray โ full API reference for all xarray classes