DataArray Schemas¶
DataArraySchema validates a single
DataArray. It is the xarray counterpart of
SeriesSchema — but for arbitrary-rank
labelled arrays rather than 1-D series.
You can also express the same constraints with the declarative DataArrayModel.
Basic usage¶
import numpy as np
import xarray as xr
import pandera.xarray as pa
schema = pa.DataArraySchema(
dtype=np.float64,
dims=("x", "y"),
name="temperature",
)
da = xr.DataArray(
np.random.rand(3, 4),
dims=("x", "y"),
name="temperature",
)
schema.validate(da)
<xarray.DataArray 'temperature' (x: 3, y: 4)> Size: 96B
array([[0.45409096, 0.06849032, 0.74974471, 0.86469529],
[0.0184841 , 0.31766111, 0.41231095, 0.45194414],
[0.64956686, 0.28907032, 0.0755333 , 0.63358133]])
Dimensions without coordinates: x, yDtype validation¶
The dtype is resolved through NumPy’s type hierarchy. You can pass a
Python type, a NumPy dtype, or a string alias:
da_float32 = xr.DataArray(np.zeros(3, dtype=np.float32), dims="x")
pa.DataArraySchema(dtype=float).validate(da)
pa.DataArraySchema(dtype=np.float32).validate(da_float32)
pa.DataArraySchema(dtype="float32").validate(da_float32)
<xarray.DataArray (x: 3)> Size: 12B array([0., 0., 0.], dtype=float32) Dimensions without coordinates: x
If dtype is None, any dtype is accepted.
Dimension validation¶
dims enforces dimension names in order. None entries act as
wildcards that match any name:
pa.DataArraySchema(dims=("x", "y")).validate(da)
pa.DataArraySchema(dims=("x", None)).validate(da)
<xarray.DataArray 'temperature' (x: 3, y: 4)> Size: 96B
array([[0.45409096, 0.06849032, 0.74974471, 0.86469529],
[0.0184841 , 0.31766111, 0.41231095, 0.45194414],
[0.64956686, 0.28907032, 0.0755333 , 0.63358133]])
Dimensions without coordinates: x, yThe tuple length also constrains the rank (ndim).
try:
pa.DataArraySchema(dims=("x", "y", "z")).validate(da)
except pa.errors.SchemaError as exc:
print(exc)
expected ndim/dims length 3 ('x', 'y', 'z'), got 2 ('x', 'y')
Sizes and shape¶
sizes is the idiomatic xarray way to constrain dimension lengths.
shape does the same thing positionally. They are mutually exclusive.
da_sized = xr.DataArray(
np.zeros((12, 180, 360)),
dims=("time", "lat", "lon"),
)
pa.DataArraySchema(
dims=("time", "lat", "lon"),
sizes={"lat": 180, "lon": 360},
).validate(da_sized)
pa.DataArraySchema(
dims=("time", "lat", "lon"),
shape=(None, 180, 360),
).validate(da_sized)
<xarray.DataArray (time: 12, lat: 180, lon: 360)> Size: 6MB
array([[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
...
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]], shape=(12, 180, 360))
Dimensions without coordinates: time, lat, lonCoordinate validation¶
Pass a dict[str, Coordinate] to validate coordinate arrays, or a
list[str] as shorthand for “these coordinates must exist”:
da_with_coords = xr.DataArray(
np.random.rand(3, 4),
dims=("x", "y"),
coords={
"x": np.arange(3, dtype=np.float64),
"y": np.arange(4, dtype=np.float64),
"label": ("x", ["a", "b", "c"]),
},
)
schema = pa.DataArraySchema(
dims=("x", "y"),
coords={
"x": pa.Coordinate(dtype=np.float64, dimension=True),
"y": pa.Coordinate(dtype=np.float64, dimension=True),
"label": pa.Coordinate(dimension=False),
},
)
schema.validate(da_with_coords)
<xarray.DataArray (x: 3, y: 4)> Size: 96B
array([[0.65039948, 0.2457349 , 0.17806292, 0.43681034],
[0.35006807, 0.30821404, 0.65653312, 0.82977431],
[0.5189375 , 0.22705908, 0.88068866, 0.89023312]])
Coordinates:
* x (x) float64 24B 0.0 1.0 2.0
label (x) <U1 12B 'a' 'b' 'c'
* y (y) float64 32B 0.0 1.0 2.0 3.0Coordinate is documented in
detail under Dataset Schemas.
Strict coordinates¶
With strict_coords=True, the schema fails if the DataArray has
coordinates not listed in coords:
strict_schema = pa.DataArraySchema(
coords={"x": pa.Coordinate()},
strict_coords=True,
)
da_x_only = xr.DataArray(
np.ones(3),
dims="x",
coords={"x": np.arange(3, dtype=np.float64)},
)
strict_schema.validate(da_x_only)
<xarray.DataArray (x: 3)> Size: 24B array([1., 1., 1.]) Coordinates: * x (x) float64 24B 0.0 1.0 2.0
try:
strict_schema.validate(da_with_coords)
except pa.errors.SchemaError as exc:
print(exc)
unexpected coordinate 'y'
Attribute validation¶
attrs validates the DataArray’s .attrs dict. Each value in the schema’s
attrs dict determines how the corresponding attribute is checked:
Literal values — matched by equality (
==).Regex patterns — strings that start with
^are treated as regular expressions and matched againststr(actual_value)viare.fullmatch.Callable predicates — any callable
(value) -> boolis invoked with the actual attribute value; validation passes when the function returnsTrue.Pydantic model — pass a
pydantic.BaseModelclass to validate the full attrs dict using pydantic’s type system.
Equality matching¶
da_attrs = xr.DataArray(
np.ones(3), dims="x",
attrs={"units": "K", "standard_name": "air_temperature"},
)
pa.DataArraySchema(
attrs={"units": "K", "standard_name": "air_temperature"},
).validate(da_attrs)
<xarray.DataArray (x: 3)> Size: 24B
array([1., 1., 1.])
Dimensions without coordinates: x
Attributes:
units: K
standard_name: air_temperatureRegex matching¶
Use a regex pattern (starting with ^) to validate an attribute against a
set of acceptable values:
schema = pa.DataArraySchema(
attrs={"units": "^(K|degC|degF)$"},
)
da_units = xr.DataArray(
np.ones(3), dims="x",
attrs={"units": "K"},
)
schema.validate(da_units)
<xarray.DataArray (x: 3)> Size: 24B
array([1., 1., 1.])
Dimensions without coordinates: x
Attributes:
units: Kda_bad_units = xr.DataArray(
np.ones(3), dims="x",
attrs={"units": "meters"},
)
try:
schema.validate(da_bad_units)
except pa.errors.SchemaError as exc:
print(exc)
attribute mismatch 'units': expected '^(K|degC|degF)$', got 'meters'
Callable predicates¶
Pass a function that receives the attribute value and returns a boolean:
schema = pa.DataArraySchema(
attrs={
"version": lambda v: isinstance(v, int) and v >= 2,
},
)
da_v3 = xr.DataArray(
np.ones(3), dims="x",
attrs={"version": 3},
)
schema.validate(da_v3)
<xarray.DataArray (x: 3)> Size: 24B
array([1., 1., 1.])
Dimensions without coordinates: x
Attributes:
version: 3da_v1 = xr.DataArray(
np.ones(3), dims="x",
attrs={"version": 1},
)
try:
schema.validate(da_v1)
except pa.errors.SchemaError as exc:
print(exc)
attribute mismatch 'version': expected <function <lambda> at 0x79fb4e507d80>, got 1
Pydantic model¶
For complex attribute schemas you can pass a pydantic.BaseModel
class instead of a dict. Pandera delegates validation to pydantic and
converts every pydantic error into a pandera SchemaError, so error
collection during lazy validation works seamlessly:
from pydantic import BaseModel, Field as PydanticField
class ArrayAttrs(BaseModel):
units: str
standard_name: str
version: int = PydanticField(ge=2)
schema = pa.DataArraySchema(attrs=ArrayAttrs)
da_ok = xr.DataArray(
np.ones(3), dims="x",
attrs={"units": "K", "standard_name": "air_temperature", "version": 3},
)
schema.validate(da_ok)
<xarray.DataArray (x: 3)> Size: 24B
array([1., 1., 1.])
Dimensions without coordinates: x
Attributes:
units: K
standard_name: air_temperature
version: 3When validation fails, the error messages surface the pydantic error details:
da_bad = xr.DataArray(
np.ones(3), dims="x",
attrs={"units": "K", "version": 1}, # version < 2, standard_name missing
)
try:
schema.validate(da_bad, lazy=True)
except pa.errors.SchemaErrors as exc:
print(exc)
{
"SCHEMA": {
"SCHEMA_COMPONENT_CHECK": [
{
"schema": "schema",
"column": null,
"check": "attrs",
"error": "standard_name: Field required [type=missing]"
},
{
"schema": "schema",
"column": null,
"check": "attrs",
"error": "version: Input should be greater than or equal to 2 [type=greater_than_equal]"
}
]
}
}
All four modes also work on
DatasetSchema — see
Dataset Schemas.
Strict attributes¶
With strict_attrs=True, extra attributes cause a validation error.
When attrs is a pydantic model class, the set of allowed keys is derived
from the model’s fields.
da_extra = xr.DataArray(
np.ones(3), dims="x",
attrs={"units": "K", "extra_key": 42},
)
try:
pa.DataArraySchema(
attrs={"units": "K"}, strict_attrs=True
).validate(da_extra)
except pa.errors.SchemaError as exc:
print(exc)
unexpected attribute 'extra_key'
Name validation¶
named_da = xr.DataArray(np.ones(3), dims="x", name="temperature")
pa.DataArraySchema(name="temperature").validate(named_da)
<xarray.DataArray 'temperature' (x: 3)> Size: 24B array([1., 1., 1.]) Dimensions without coordinates: x
The DataArray’s .name must match exactly.
try:
pa.DataArraySchema(name="pressure").validate(named_da)
except pa.errors.SchemaError as exc:
print(exc)
expected name 'pressure', got 'temperature'
Null values¶
By default nullable=False — any NaN or null value raises a
SchemaError. Set nullable=True to allow them:
da_with_nan = xr.DataArray([1.0, np.nan, 3.0], dims="x")
pa.DataArraySchema(dtype=float, nullable=True).validate(da_with_nan)
<xarray.DataArray (x: 3)> Size: 24B array([ 1., nan, 3.]) Dimensions without coordinates: x
try:
pa.DataArraySchema(dtype=float, nullable=False).validate(da_with_nan)
except pa.errors.SchemaError as exc:
print(exc)
non-nullable DataArray contains null values
Coercing dtypes¶
When coerce=True, the DataArray is cast to dtype before validation:
schema = pa.DataArraySchema(dtype=np.float32, coerce=True)
da_int = xr.DataArray(np.array([1, 2, 3]), dims="x")
validated = schema.validate(da_int)
print(f"original: {da_int.dtype} -> coerced: {validated.dtype}")
original: int64 -> coerced: float32
Encoding validation¶
The encoding parameter validates the DataArray’s .encoding dict, which is
populated when reading from netCDF or Zarr:
da_encoded = xr.DataArray(np.ones(3), dims="x")
da_encoded.encoding = {"_FillValue": -999.0, "dtype": "float32"}
pa.DataArraySchema(
encoding={"_FillValue": -999.0, "dtype": "^float.*"},
).validate(da_encoded)
<xarray.DataArray (x: 3)> Size: 24B array([1., 1., 1.]) Dimensions without coordinates: x
Encoding supports the same matching modes as attrs (equality, regex,
callable) plus pydantic models. See Encoding Validation for full details.
Chunked / array type¶
Control whether the underlying storage is lazy (Dask) or eager (NumPy):
pa.DataArraySchema(chunked=True) # must be Dask-backed
pa.DataArraySchema(chunked=False) # must be eager
pa.DataArraySchema(array_type=np.ndarray) # must be a numpy array
See Dask and Duck Arrays for comprehensive Dask integration
documentation, and Configuration for how chunked interacts
with validation depth.
Data-level checks¶
Use Check for value-level assertions:
schema = pa.DataArraySchema(
dtype=np.float64,
checks=[
pa.Check(lambda da: float(da.min()) >= 0),
pa.Check(lambda da: float(da.max()) <= 100),
],
)
da_checked = xr.DataArray(np.linspace(0, 50, 10), dims="x")
schema.validate(da_checked)
<xarray.DataArray (x: 10)> Size: 80B
array([ 0. , 5.55555556, 11.11111111, 16.66666667, 22.22222222,
27.77777778, 33.33333333, 38.88888889, 44.44444444, 50. ])
Dimensions without coordinates: xSee Checks and Parsers for built-in check helpers and details on how checks interact with lazy / chunked data.
Parsers¶
Parser objects run before checks and can
transform the array:
schema = pa.DataArraySchema(
parsers=pa.Parser(lambda da: da.fillna(0)),
nullable=False,
)
da_nulls = xr.DataArray([1.0, np.nan, 3.0], dims="x")
validated = schema.validate(da_nulls)
validated
<xarray.DataArray (x: 3)> Size: 24B array([1., 0., 3.]) Dimensions without coordinates: x
Validation options¶
schema.validate(da) accepts several keyword arguments:
lazy— collect all failures intoSchemaErrorsinstead of raising on the first one.head/tail/sample— subsample along the first dimension before running heavy checks.inplace— ifTrue, coercion mutates the original object.
schema = pa.DataArraySchema(
dtype=np.float64,
dims=("x",),
name="values",
checks=pa.Check(lambda da: bool((da > 0).all())),
)
da_bad = xr.DataArray([-1, 2, 3], dims="x", name="wrong_name")
try:
schema.validate(da_bad, lazy=True)
except pa.errors.SchemaErrors as exc:
print(exc)
{
"SCHEMA": {
"WRONG_FIELD_NAME": [
{
"schema": "values",
"column": "values",
"check": "name",
"error": "expected name 'values', got 'wrong_name'"
}
],
"WRONG_DATATYPE": [
{
"schema": "values",
"column": "values",
"check": "dtype(<class 'numpy.float64'>)",
"error": "expected dtype <class 'numpy.float64'>, got int64"
}
]
},
"DATA": {
"DATAFRAME_CHECK": [
{
"schema": "values",
"column": "values",
"check": "<lambda>",
"error": "DataArraySchema 'values' failed series or dataframe validator 0: <Check <lambda>>"
}
]
}
}
See also¶
Dataset Schemas —
DatasetSchemafor multi-variable dataData Models — class-based
DataArrayModelChecks and Parsers — checks, parsers, lazy validation
Encoding Validation — encoding validation (netCDF/Zarr metadata)
Dask and Duck Arrays — Dask integration,
chunked, andarray_typeCF Convention Checks — CF convention checks
Decorators —
check_input,check_output,check_io, andcheck_typesConfiguration —
ValidationDepth,ValidationScope, Dask, environment variablesXarray — full API reference for all xarray classes