Dataset SchemasΒΆ
DatasetSchema validates an
Dataset β a dict-like container of aligned
DataArray objects. It is the xarray counterpart of
DataFrameSchema: each data variable
corresponds to a Column, and shared coordinates correspond to an Index.
You can also express the same constraints with the declarative DatasetModel.
Basic usageΒΆ
import numpy as np
import xarray as xr
import pandera.xarray as pa
schema = pa.DatasetSchema(
data_vars={
"temperature": pa.DataVar(dtype=np.float64, dims=("x", "y")),
"pressure": pa.DataVar(dtype=np.float64, dims=("x", "y")),
},
coords={"x": pa.Coordinate(dtype=np.float64)},
)
ds = xr.Dataset(
{
"temperature": (("x", "y"), np.random.rand(3, 4)),
"pressure": (("x", "y"), np.random.rand(3, 4)),
},
coords={"x": np.arange(3, dtype=np.float64)},
)
schema.validate(ds)
<xarray.Dataset> Size: 216B
Dimensions: (x: 3, y: 4)
Coordinates:
* x (x) float64 24B 0.0 1.0 2.0
Dimensions without coordinates: y
Data variables:
temperature (x, y) float64 96B 0.03566 0.1425 0.5963 ... 0.3645 0.3388
pressure (x, y) float64 96B 0.6806 0.19 0.5897 ... 0.007568 0.6615DataVarΒΆ
DataVar describes one variable inside a
Dataset. It carries the same structural constraints as DataArraySchema
(dtype, dims, sizes, shape, coords, attrs, checks, parsers,
coerce, nullable, chunked, array_type, strict_coords, strict_attrs)
plus dataset-only options.
Required variablesΒΆ
By default every DataVar is required. Set required=False to make it
optional:
schema = pa.DatasetSchema(
data_vars={
"temperature": pa.DataVar(dtype=np.float64, dims=("x",)),
"humidity": pa.DataVar(dtype=np.float64, dims=("x",), required=False),
},
)
ds_no_humidity = xr.Dataset({"temperature": (("x",), np.ones(3))})
schema.validate(ds_no_humidity)
<xarray.Dataset> Size: 24B
Dimensions: (x: 3)
Dimensions without coordinates: x
Data variables:
temperature (x) float64 24B 1.0 1.0 1.0Default valuesΒΆ
When required=False, you can specify a default to fill in missing
variables during validation:
schema = pa.DatasetSchema(
data_vars={
"temperature": pa.DataVar(dtype=np.float64, dims=("x",)),
"humidity": pa.DataVar(
dtype=np.float64, dims=("x",), required=False, default=0.0
),
},
)
validated = schema.validate(ds_no_humidity)
validated
<xarray.Dataset> Size: 48B
Dimensions: (x: 3)
Dimensions without coordinates: x
Data variables:
temperature (x) float64 24B 1.0 1.0 1.0
humidity (x) float64 24B 0.0 0.0 0.0AliasesΒΆ
If the logical schema name differs from the actual name in the dataset:
schema = pa.DatasetSchema(
data_vars={
"temp": pa.DataVar(dtype=np.float64, alias="temp_kelvin"),
},
)
ds_alias = xr.Dataset({"temp_kelvin": (("x",), np.ones(3))})
schema.validate(ds_alias)
<xarray.Dataset> Size: 24B
Dimensions: (x: 3)
Dimensions without coordinates: x
Data variables:
temp_kelvin (x) float64 24B 1.0 1.0 1.0Alignment constraintsΒΆ
aligned_with and broadcastable_with express grid relationships to other
data variables:
schema = pa.DatasetSchema(
data_vars={
"temperature": pa.DataVar(dtype=np.float64, dims=("x", "y")),
"pressure": pa.DataVar(
dtype=np.float64,
dims=("x", "y"),
aligned_with=("temperature",),
),
"elevation": pa.DataVar(
dtype=np.float64,
dims=("x",),
broadcastable_with=("temperature",),
),
},
)
ds_aligned = xr.Dataset(
{
"temperature": (("x", "y"), np.random.rand(3, 4)),
"pressure": (("x", "y"), np.random.rand(3, 4)),
"elevation": (("x",), np.ones(3)),
},
)
schema.validate(ds_aligned)
<xarray.Dataset> Size: 216B
Dimensions: (x: 3, y: 4)
Dimensions without coordinates: x, y
Data variables:
temperature (x, y) float64 96B 0.7616 0.3642 0.3539 ... 0.2062 0.05332
pressure (x, y) float64 96B 0.01328 0.1262 0.03881 ... 0.5773 0.7626
elevation (x) float64 24B 1.0 1.0 1.0Using DataArraySchema directlyΒΆ
Instead of DataVar, you can pass a DataArraySchema as the spec for a
variable. This reuses a schema youβve already defined:
temp_schema = pa.DataArraySchema(dtype=np.float64, dims=("x", "y"))
schema = pa.DatasetSchema(
data_vars={
"temperature": temp_schema,
"pressure": pa.DataVar(dtype=np.float64, dims=("x", "y")),
},
)
schema.validate(ds)
<xarray.Dataset> Size: 216B
Dimensions: (x: 3, y: 4)
Coordinates:
* x (x) float64 24B 0.0 1.0 2.0
Dimensions without coordinates: y
Data variables:
temperature (x, y) float64 96B 0.03566 0.1425 0.5963 ... 0.3645 0.3388
pressure (x, y) float64 96B 0.6806 0.19 0.5897 ... 0.007568 0.6615None as a placeholderΒΆ
None means βvariable must exist, but no value-level checksβ:
schema = pa.DatasetSchema(data_vars={"temperature": None})
schema.validate(ds)
<xarray.Dataset> Size: 216B
Dimensions: (x: 3, y: 4)
Coordinates:
* x (x) float64 24B 0.0 1.0 2.0
Dimensions without coordinates: y
Data variables:
temperature (x, y) float64 96B 0.03566 0.1425 0.5963 ... 0.3645 0.3388
pressure (x, y) float64 96B 0.6806 0.19 0.5897 ... 0.007568 0.6615CoordinateΒΆ
Coordinate validates an individual
coordinate array. Use it inside the coords dict on DataArraySchema or
DatasetSchema.
Dimension vs auxiliary coordinatesΒΆ
In xarray, a dimension coordinate is a 1-D coordinate whose name matches a dimension name (used for label-based indexing). An auxiliary coordinate does not match any dimension name and can be multi-dimensional.
ds_with_aux = xr.Dataset(
{"a": (("x", "y"), np.random.rand(3, 4))},
coords={
"x": np.arange(3, dtype=np.float64),
"label": ("x", ["site_a", "site_b", "site_c"]),
},
)
schema = pa.DatasetSchema(
data_vars={"a": pa.DataVar(dtype=float, dims=("x", "y"))},
coords={
"x": pa.Coordinate(dtype=np.float64, dimension=True),
"label": pa.Coordinate(dtype=str, dimension=False),
},
)
schema.validate(ds_with_aux)
<xarray.Dataset> Size: 192B
Dimensions: (x: 3, y: 4)
Coordinates:
* x (x) float64 24B 0.0 1.0 2.0
label (x) <U6 72B 'site_a' 'site_b' 'site_c'
Dimensions without coordinates: y
Data variables:
a (x, y) float64 96B 0.007235 0.5729 0.6044 ... 0.9755 0.6934 0.2191Indexed coordinatesΒΆ
An indexed coordinate has an associated xarray Index and can be used
with .sel(). indexed=True requires this; indexed=False forbids it.
schema = pa.DatasetSchema(
data_vars={"a": pa.DataVar(dtype=float, dims=("x",))},
coords={
"x": pa.Coordinate(dtype=np.float64, dimension=True, indexed=True),
},
)
ds_indexed = xr.Dataset(
{"a": (("x",), np.ones(3))},
coords={"x": np.arange(3, dtype=np.float64)},
)
schema.validate(ds_indexed)
<xarray.Dataset> Size: 48B
Dimensions: (x: 3)
Coordinates:
* x (x) float64 24B 0.0 1.0 2.0
Data variables:
a (x) float64 24B 1.0 1.0 1.0Checks on coordinatesΒΆ
Coordinates are DataArray objects, so you can attach checks:
schema = pa.DatasetSchema(
data_vars={"a": pa.DataVar(dtype=float, dims=("lat",))},
coords={
"lat": pa.Coordinate(
dtype=np.float64,
checks=pa.Check(lambda c: float(c.min()) >= -90),
),
},
)
ds_lat = xr.Dataset(
{"a": (("lat",), np.ones(5))},
coords={"lat": np.linspace(-45, 45, 5)},
)
schema.validate(ds_lat)
<xarray.Dataset> Size: 80B
Dimensions: (lat: 5)
Coordinates:
* lat (lat) float64 40B -45.0 -22.5 0.0 22.5 45.0
Data variables:
a (lat) float64 40B 1.0 1.0 1.0 1.0 1.0Dimensions and sizesΒΆ
Dataset-level dims and sizes constrain the overall dimension structure,
independent of individual DataVar specs:
schema = pa.DatasetSchema(
data_vars={
"temperature": pa.DataVar(dtype=float, dims=("x", "y")),
},
dims=("x", "y"),
sizes={"x": 3, "y": 4},
)
ds_sized = xr.Dataset(
{"temperature": (("x", "y"), np.random.rand(3, 4))},
)
schema.validate(ds_sized)
<xarray.Dataset> Size: 96B
Dimensions: (x: 3, y: 4)
Dimensions without coordinates: x, y
Data variables:
temperature (x, y) float64 96B 0.3644 0.1446 0.6516 ... 0.3306 0.4599 0.754AttributesΒΆ
attrs validates the Datasetβs .attrs dict. Each value in the schemaβs
attrs dict determines how the corresponding attribute is checked:
Literal values β matched by equality (
==).Regex patterns β strings that start with
^are treated as regular expressions and matched againststr(actual_value)viare.fullmatch.Callable predicates β any callable
(value) -> boolis invoked with the actual attribute value; validation passes when the function returnsTrue.
Equality matchingΒΆ
schema = pa.DatasetSchema(
data_vars={"temperature": pa.DataVar(dtype=float)},
attrs={"source": "reanalysis"},
)
ds_attrs = xr.Dataset(
{"temperature": (("x",), np.ones(3))},
attrs={"source": "reanalysis"},
)
schema.validate(ds_attrs)
<xarray.Dataset> Size: 24B
Dimensions: (x: 3)
Dimensions without coordinates: x
Data variables:
temperature (x) float64 24B 1.0 1.0 1.0
Attributes:
source: reanalysisRegex matchingΒΆ
Use a regex pattern (starting with ^) to validate an attribute against a
set of acceptable values:
schema = pa.DatasetSchema(
data_vars={"temperature": pa.DataVar(dtype=float)},
attrs={"units": "^(K|degC|degF)$"},
)
ds_units = xr.Dataset(
{"temperature": (("x",), np.ones(3))},
attrs={"units": "K"},
)
schema.validate(ds_units)
<xarray.Dataset> Size: 24B
Dimensions: (x: 3)
Dimensions without coordinates: x
Data variables:
temperature (x) float64 24B 1.0 1.0 1.0
Attributes:
units: Kds_bad_units = xr.Dataset(
{"temperature": (("x",), np.ones(3))},
attrs={"units": "meters"},
)
try:
schema.validate(ds_bad_units)
except pa.errors.SchemaError as exc:
print(exc)
dataset attribute 'units': expected '^(K|degC|degF)$', got 'meters'
Callable predicatesΒΆ
Pass a function that receives the attribute value and returns a boolean:
schema = pa.DatasetSchema(
data_vars={"temperature": pa.DataVar(dtype=float)},
attrs={
"version": lambda v: isinstance(v, int) and v >= 2,
},
)
ds_v3 = xr.Dataset(
{"temperature": (("x",), np.ones(3))},
attrs={"version": 3},
)
schema.validate(ds_v3)
<xarray.Dataset> Size: 24B
Dimensions: (x: 3)
Dimensions without coordinates: x
Data variables:
temperature (x) float64 24B 1.0 1.0 1.0
Attributes:
version: 3ds_v1 = xr.Dataset(
{"temperature": (("x",), np.ones(3))},
attrs={"version": 1},
)
try:
schema.validate(ds_v1)
except pa.errors.SchemaError as exc:
print(exc)
dataset attribute 'version': expected <function <lambda> at 0x72adec0307c0>, got 1
Pydantic modelΒΆ
For complex attribute schemas you can pass a pydantic.BaseModel
class instead of a dict. Pandera delegates validation to pydantic and
converts every pydantic error into a pandera SchemaError, so error
collection during lazy validation works seamlessly:
from pydantic import BaseModel, Field as PydanticField
class DatasetAttrs(BaseModel):
source: str
version: int = PydanticField(ge=2)
units: str
schema = pa.DatasetSchema(
data_vars={"temperature": pa.DataVar(dtype=float)},
attrs=DatasetAttrs,
)
ds_ok = xr.Dataset(
{"temperature": (("x",), np.ones(3))},
attrs={"source": "ERA5", "version": 5, "units": "K"},
)
schema.validate(ds_ok)
<xarray.Dataset> Size: 24B
Dimensions: (x: 3)
Dimensions without coordinates: x
Data variables:
temperature (x) float64 24B 1.0 1.0 1.0
Attributes:
source: ERA5
version: 5
units: KWhen validation fails, the error messages surface the pydantic error details:
ds_bad = xr.Dataset(
{"temperature": (("x",), np.ones(3))},
attrs={"source": "ERA5", "version": 1}, # version < 2, units missing
)
try:
schema.validate(ds_bad, lazy=True)
except pa.errors.SchemaErrors as exc:
print(exc)
{
"SCHEMA": {
"SCHEMA_COMPONENT_CHECK": [
{
"schema": "schema",
"column": null,
"check": "attrs",
"error": "dataset version: Input should be greater than or equal to 2 [type=greater_than_equal]"
},
{
"schema": "schema",
"column": null,
"check": "attrs",
"error": "dataset units: Field required [type=missing]"
}
]
}
}
All four modes (equality, regex, callable, pydantic) also work on
DataArraySchema β see
DataArray Schemas.
Strict modeΒΆ
strict=Trueβ fail if the dataset has data variables not listed indata_vars.strict="filter"β drop unlisted variables and return the filtered dataset.strict=False(default) β extra variables are allowed.
schema = pa.DatasetSchema(
data_vars={"temperature": pa.DataVar(dtype=float)},
strict=True,
)
ds_extra = xr.Dataset({
"temperature": (("x",), np.ones(3)),
"extra": (("x",), np.zeros(3)),
})
try:
schema.validate(ds_extra)
except pa.errors.SchemaError as exc:
print(exc)
unexpected data variables: ['extra']
filter_schema = pa.DatasetSchema(
data_vars={"temperature": pa.DataVar(dtype=float)},
strict="filter",
)
filtered = filter_schema.validate(ds_extra)
print(list(filtered.data_vars))
['temperature']
Strict coordinates and attributesΒΆ
strict_coords and strict_attrs work the same way at the coordinate and
attribute level:
schema = pa.DatasetSchema(
data_vars={"a": pa.DataVar(dtype=float, dims=("x",))},
coords={"x": pa.Coordinate()},
strict_coords=True,
)
ds_one_coord = xr.Dataset(
{"a": (("x",), np.ones(3))},
coords={"x": np.arange(3, dtype=np.float64)},
)
schema.validate(ds_one_coord)
<xarray.Dataset> Size: 48B
Dimensions: (x: 3)
Coordinates:
* x (x) float64 24B 0.0 1.0 2.0
Data variables:
a (x) float64 24B 1.0 1.0 1.0Encoding validationΒΆ
Encoding can be validated at two levels in a DatasetSchema:
Per-variable β
DataVar(encoding=...)validatesds[var].encodingDataset-level β
DatasetSchema(encoding=...)validatesds.encoding
Both support dict-based matching (equality, regex, callable) and pydantic models. See Encoding Validation for full details and examples.
Dataset-level checksΒΆ
Checks on the DatasetSchema receive the entire Dataset:
schema = pa.DatasetSchema(
data_vars={
"a": pa.DataVar(dtype=float, dims=("x",)),
"b": pa.DataVar(dtype=float, dims=("x",)),
},
checks=pa.Check(lambda ds: bool((ds["a"] < ds["b"]).all())),
)
ds_ordered = xr.Dataset({
"a": (("x",), [1.0, 2.0, 3.0]),
"b": (("x",), [4.0, 5.0, 6.0]),
})
schema.validate(ds_ordered)
<xarray.Dataset> Size: 48B
Dimensions: (x: 3)
Dimensions without coordinates: x
Data variables:
a (x) float64 24B 1.0 2.0 3.0
b (x) float64 24B 4.0 5.0 6.0Lazy validationΒΆ
Pass lazy=True to collect all errors into a single
SchemaErrors:
schema = pa.DatasetSchema(
data_vars={
"temperature": pa.DataVar(dtype=np.float64, dims=("x",)),
},
strict=True,
)
ds_bad = xr.Dataset({
"temperature": (("y",), np.ones(3)),
"extra_var": (("x",), np.zeros(3)),
})
try:
schema.validate(ds_bad, lazy=True)
except pa.errors.SchemaErrors as exc:
print(exc)
{
"SCHEMA": {
"COLUMN_NOT_IN_SCHEMA": [
{
"schema": "schema",
"column": null,
"check": "strict_data_vars",
"error": "unexpected data variables: ['extra_var']"
}
],
"MISMATCH_INDEX": [
{
"schema": "schema",
"column": "temperature",
"check": "dims",
"error": "dim position 0: expected 'x', got 'y'"
}
]
}
}
See alsoΒΆ
DataArray Schemas β single-array
DataArraySchemavalidationData Models β class-based
DatasetModelChecks and Parsers β checks, parsers, lazy validation
Encoding Validation β encoding validation (per-variable and dataset-level)
Dask and Duck Arrays β Dask integration,
chunked, andarray_typeCF Convention Checks β CF convention checks
Decorators β
check_input,check_output,check_io, andcheck_typesConfiguration β
ValidationDepth,ValidationScope, Dask, environment variablesXarray β full API reference for all xarray classes