Dataset SchemasΒΆ

DatasetSchema validates an Dataset β€” a dict-like container of aligned DataArray objects. It is the xarray counterpart of DataFrameSchema: each data variable corresponds to a Column, and shared coordinates correspond to an Index.

You can also express the same constraints with the declarative DatasetModel.

Basic usageΒΆ

import numpy as np
import xarray as xr
import pandera.xarray as pa

schema = pa.DatasetSchema(
    data_vars={
        "temperature": pa.DataVar(dtype=np.float64, dims=("x", "y")),
        "pressure": pa.DataVar(dtype=np.float64, dims=("x", "y")),
    },
    coords={"x": pa.Coordinate(dtype=np.float64)},
)

ds = xr.Dataset(
    {
        "temperature": (("x", "y"), np.random.rand(3, 4)),
        "pressure": (("x", "y"), np.random.rand(3, 4)),
    },
    coords={"x": np.arange(3, dtype=np.float64)},
)
schema.validate(ds)
<xarray.Dataset> Size: 216B
Dimensions:      (x: 3, y: 4)
Coordinates:
  * x            (x) float64 24B 0.0 1.0 2.0
Dimensions without coordinates: y
Data variables:
    temperature  (x, y) float64 96B 0.03566 0.1425 0.5963 ... 0.3645 0.3388
    pressure     (x, y) float64 96B 0.6806 0.19 0.5897 ... 0.007568 0.6615

DataVarΒΆ

DataVar describes one variable inside a Dataset. It carries the same structural constraints as DataArraySchema (dtype, dims, sizes, shape, coords, attrs, checks, parsers, coerce, nullable, chunked, array_type, strict_coords, strict_attrs) plus dataset-only options.

Required variablesΒΆ

By default every DataVar is required. Set required=False to make it optional:

schema = pa.DatasetSchema(
    data_vars={
        "temperature": pa.DataVar(dtype=np.float64, dims=("x",)),
        "humidity": pa.DataVar(dtype=np.float64, dims=("x",), required=False),
    },
)

ds_no_humidity = xr.Dataset({"temperature": (("x",), np.ones(3))})
schema.validate(ds_no_humidity)
<xarray.Dataset> Size: 24B
Dimensions:      (x: 3)
Dimensions without coordinates: x
Data variables:
    temperature  (x) float64 24B 1.0 1.0 1.0

Default valuesΒΆ

When required=False, you can specify a default to fill in missing variables during validation:

schema = pa.DatasetSchema(
    data_vars={
        "temperature": pa.DataVar(dtype=np.float64, dims=("x",)),
        "humidity": pa.DataVar(
            dtype=np.float64, dims=("x",), required=False, default=0.0
        ),
    },
)

validated = schema.validate(ds_no_humidity)
validated
<xarray.Dataset> Size: 48B
Dimensions:      (x: 3)
Dimensions without coordinates: x
Data variables:
    temperature  (x) float64 24B 1.0 1.0 1.0
    humidity     (x) float64 24B 0.0 0.0 0.0

AliasesΒΆ

If the logical schema name differs from the actual name in the dataset:

schema = pa.DatasetSchema(
    data_vars={
        "temp": pa.DataVar(dtype=np.float64, alias="temp_kelvin"),
    },
)

ds_alias = xr.Dataset({"temp_kelvin": (("x",), np.ones(3))})
schema.validate(ds_alias)
<xarray.Dataset> Size: 24B
Dimensions:      (x: 3)
Dimensions without coordinates: x
Data variables:
    temp_kelvin  (x) float64 24B 1.0 1.0 1.0

Alignment constraintsΒΆ

aligned_with and broadcastable_with express grid relationships to other data variables:

schema = pa.DatasetSchema(
    data_vars={
        "temperature": pa.DataVar(dtype=np.float64, dims=("x", "y")),
        "pressure": pa.DataVar(
            dtype=np.float64,
            dims=("x", "y"),
            aligned_with=("temperature",),
        ),
        "elevation": pa.DataVar(
            dtype=np.float64,
            dims=("x",),
            broadcastable_with=("temperature",),
        ),
    },
)

ds_aligned = xr.Dataset(
    {
        "temperature": (("x", "y"), np.random.rand(3, 4)),
        "pressure": (("x", "y"), np.random.rand(3, 4)),
        "elevation": (("x",), np.ones(3)),
    },
)
schema.validate(ds_aligned)
<xarray.Dataset> Size: 216B
Dimensions:      (x: 3, y: 4)
Dimensions without coordinates: x, y
Data variables:
    temperature  (x, y) float64 96B 0.7616 0.3642 0.3539 ... 0.2062 0.05332
    pressure     (x, y) float64 96B 0.01328 0.1262 0.03881 ... 0.5773 0.7626
    elevation    (x) float64 24B 1.0 1.0 1.0

Using DataArraySchema directlyΒΆ

Instead of DataVar, you can pass a DataArraySchema as the spec for a variable. This reuses a schema you’ve already defined:

temp_schema = pa.DataArraySchema(dtype=np.float64, dims=("x", "y"))

schema = pa.DatasetSchema(
    data_vars={
        "temperature": temp_schema,
        "pressure": pa.DataVar(dtype=np.float64, dims=("x", "y")),
    },
)
schema.validate(ds)
<xarray.Dataset> Size: 216B
Dimensions:      (x: 3, y: 4)
Coordinates:
  * x            (x) float64 24B 0.0 1.0 2.0
Dimensions without coordinates: y
Data variables:
    temperature  (x, y) float64 96B 0.03566 0.1425 0.5963 ... 0.3645 0.3388
    pressure     (x, y) float64 96B 0.6806 0.19 0.5897 ... 0.007568 0.6615

None as a placeholderΒΆ

None means β€œvariable must exist, but no value-level checks”:

schema = pa.DatasetSchema(data_vars={"temperature": None})
schema.validate(ds)
<xarray.Dataset> Size: 216B
Dimensions:      (x: 3, y: 4)
Coordinates:
  * x            (x) float64 24B 0.0 1.0 2.0
Dimensions without coordinates: y
Data variables:
    temperature  (x, y) float64 96B 0.03566 0.1425 0.5963 ... 0.3645 0.3388
    pressure     (x, y) float64 96B 0.6806 0.19 0.5897 ... 0.007568 0.6615

CoordinateΒΆ

Coordinate validates an individual coordinate array. Use it inside the coords dict on DataArraySchema or DatasetSchema.

Dimension vs auxiliary coordinatesΒΆ

In xarray, a dimension coordinate is a 1-D coordinate whose name matches a dimension name (used for label-based indexing). An auxiliary coordinate does not match any dimension name and can be multi-dimensional.

ds_with_aux = xr.Dataset(
    {"a": (("x", "y"), np.random.rand(3, 4))},
    coords={
        "x": np.arange(3, dtype=np.float64),
        "label": ("x", ["site_a", "site_b", "site_c"]),
    },
)

schema = pa.DatasetSchema(
    data_vars={"a": pa.DataVar(dtype=float, dims=("x", "y"))},
    coords={
        "x": pa.Coordinate(dtype=np.float64, dimension=True),
        "label": pa.Coordinate(dtype=str, dimension=False),
    },
)
schema.validate(ds_with_aux)
<xarray.Dataset> Size: 192B
Dimensions:  (x: 3, y: 4)
Coordinates:
  * x        (x) float64 24B 0.0 1.0 2.0
    label    (x) <U6 72B 'site_a' 'site_b' 'site_c'
Dimensions without coordinates: y
Data variables:
    a        (x, y) float64 96B 0.007235 0.5729 0.6044 ... 0.9755 0.6934 0.2191

Indexed coordinatesΒΆ

An indexed coordinate has an associated xarray Index and can be used with .sel(). indexed=True requires this; indexed=False forbids it.

schema = pa.DatasetSchema(
    data_vars={"a": pa.DataVar(dtype=float, dims=("x",))},
    coords={
        "x": pa.Coordinate(dtype=np.float64, dimension=True, indexed=True),
    },
)

ds_indexed = xr.Dataset(
    {"a": (("x",), np.ones(3))},
    coords={"x": np.arange(3, dtype=np.float64)},
)
schema.validate(ds_indexed)
<xarray.Dataset> Size: 48B
Dimensions:  (x: 3)
Coordinates:
  * x        (x) float64 24B 0.0 1.0 2.0
Data variables:
    a        (x) float64 24B 1.0 1.0 1.0

Checks on coordinatesΒΆ

Coordinates are DataArray objects, so you can attach checks:

schema = pa.DatasetSchema(
    data_vars={"a": pa.DataVar(dtype=float, dims=("lat",))},
    coords={
        "lat": pa.Coordinate(
            dtype=np.float64,
            checks=pa.Check(lambda c: float(c.min()) >= -90),
        ),
    },
)

ds_lat = xr.Dataset(
    {"a": (("lat",), np.ones(5))},
    coords={"lat": np.linspace(-45, 45, 5)},
)
schema.validate(ds_lat)
<xarray.Dataset> Size: 80B
Dimensions:  (lat: 5)
Coordinates:
  * lat      (lat) float64 40B -45.0 -22.5 0.0 22.5 45.0
Data variables:
    a        (lat) float64 40B 1.0 1.0 1.0 1.0 1.0

Dimensions and sizesΒΆ

Dataset-level dims and sizes constrain the overall dimension structure, independent of individual DataVar specs:

schema = pa.DatasetSchema(
    data_vars={
        "temperature": pa.DataVar(dtype=float, dims=("x", "y")),
    },
    dims=("x", "y"),
    sizes={"x": 3, "y": 4},
)

ds_sized = xr.Dataset(
    {"temperature": (("x", "y"), np.random.rand(3, 4))},
)
schema.validate(ds_sized)
<xarray.Dataset> Size: 96B
Dimensions:      (x: 3, y: 4)
Dimensions without coordinates: x, y
Data variables:
    temperature  (x, y) float64 96B 0.3644 0.1446 0.6516 ... 0.3306 0.4599 0.754

AttributesΒΆ

attrs validates the Dataset’s .attrs dict. Each value in the schema’s attrs dict determines how the corresponding attribute is checked:

  • Literal values β€” matched by equality (==).

  • Regex patterns β€” strings that start with ^ are treated as regular expressions and matched against str(actual_value) via re.fullmatch.

  • Callable predicates β€” any callable (value) -> bool is invoked with the actual attribute value; validation passes when the function returns True.

Equality matchingΒΆ

schema = pa.DatasetSchema(
    data_vars={"temperature": pa.DataVar(dtype=float)},
    attrs={"source": "reanalysis"},
)

ds_attrs = xr.Dataset(
    {"temperature": (("x",), np.ones(3))},
    attrs={"source": "reanalysis"},
)
schema.validate(ds_attrs)
<xarray.Dataset> Size: 24B
Dimensions:      (x: 3)
Dimensions without coordinates: x
Data variables:
    temperature  (x) float64 24B 1.0 1.0 1.0
Attributes:
    source:   reanalysis

Regex matchingΒΆ

Use a regex pattern (starting with ^) to validate an attribute against a set of acceptable values:

schema = pa.DatasetSchema(
    data_vars={"temperature": pa.DataVar(dtype=float)},
    attrs={"units": "^(K|degC|degF)$"},
)

ds_units = xr.Dataset(
    {"temperature": (("x",), np.ones(3))},
    attrs={"units": "K"},
)
schema.validate(ds_units)
<xarray.Dataset> Size: 24B
Dimensions:      (x: 3)
Dimensions without coordinates: x
Data variables:
    temperature  (x) float64 24B 1.0 1.0 1.0
Attributes:
    units:    K
ds_bad_units = xr.Dataset(
    {"temperature": (("x",), np.ones(3))},
    attrs={"units": "meters"},
)

try:
    schema.validate(ds_bad_units)
except pa.errors.SchemaError as exc:
    print(exc)
dataset attribute 'units': expected '^(K|degC|degF)$', got 'meters'

Callable predicatesΒΆ

Pass a function that receives the attribute value and returns a boolean:

schema = pa.DatasetSchema(
    data_vars={"temperature": pa.DataVar(dtype=float)},
    attrs={
        "version": lambda v: isinstance(v, int) and v >= 2,
    },
)

ds_v3 = xr.Dataset(
    {"temperature": (("x",), np.ones(3))},
    attrs={"version": 3},
)
schema.validate(ds_v3)
<xarray.Dataset> Size: 24B
Dimensions:      (x: 3)
Dimensions without coordinates: x
Data variables:
    temperature  (x) float64 24B 1.0 1.0 1.0
Attributes:
    version:  3
ds_v1 = xr.Dataset(
    {"temperature": (("x",), np.ones(3))},
    attrs={"version": 1},
)

try:
    schema.validate(ds_v1)
except pa.errors.SchemaError as exc:
    print(exc)
dataset attribute 'version': expected <function <lambda> at 0x72adec0307c0>, got 1

Pydantic modelΒΆ

For complex attribute schemas you can pass a pydantic.BaseModel class instead of a dict. Pandera delegates validation to pydantic and converts every pydantic error into a pandera SchemaError, so error collection during lazy validation works seamlessly:

from pydantic import BaseModel, Field as PydanticField

class DatasetAttrs(BaseModel):
    source: str
    version: int = PydanticField(ge=2)
    units: str
schema = pa.DatasetSchema(
    data_vars={"temperature": pa.DataVar(dtype=float)},
    attrs=DatasetAttrs,
)

ds_ok = xr.Dataset(
    {"temperature": (("x",), np.ones(3))},
    attrs={"source": "ERA5", "version": 5, "units": "K"},
)
schema.validate(ds_ok)
<xarray.Dataset> Size: 24B
Dimensions:      (x: 3)
Dimensions without coordinates: x
Data variables:
    temperature  (x) float64 24B 1.0 1.0 1.0
Attributes:
    source:   ERA5
    version:  5
    units:    K

When validation fails, the error messages surface the pydantic error details:

ds_bad = xr.Dataset(
    {"temperature": (("x",), np.ones(3))},
    attrs={"source": "ERA5", "version": 1},  # version < 2, units missing
)

try:
    schema.validate(ds_bad, lazy=True)
except pa.errors.SchemaErrors as exc:
    print(exc)
{
    "SCHEMA": {
        "SCHEMA_COMPONENT_CHECK": [
            {
                "schema": "schema",
                "column": null,
                "check": "attrs",
                "error": "dataset version: Input should be greater than or equal to 2 [type=greater_than_equal]"
            },
            {
                "schema": "schema",
                "column": null,
                "check": "attrs",
                "error": "dataset units: Field required [type=missing]"
            }
        ]
    }
}

All four modes (equality, regex, callable, pydantic) also work on DataArraySchema β€” see DataArray Schemas.

Strict modeΒΆ

  • strict=True β€” fail if the dataset has data variables not listed in data_vars.

  • strict="filter" β€” drop unlisted variables and return the filtered dataset.

  • strict=False (default) β€” extra variables are allowed.

schema = pa.DatasetSchema(
    data_vars={"temperature": pa.DataVar(dtype=float)},
    strict=True,
)

ds_extra = xr.Dataset({
    "temperature": (("x",), np.ones(3)),
    "extra": (("x",), np.zeros(3)),
})

try:
    schema.validate(ds_extra)
except pa.errors.SchemaError as exc:
    print(exc)
unexpected data variables: ['extra']
filter_schema = pa.DatasetSchema(
    data_vars={"temperature": pa.DataVar(dtype=float)},
    strict="filter",
)

filtered = filter_schema.validate(ds_extra)
print(list(filtered.data_vars))
['temperature']

Strict coordinates and attributesΒΆ

strict_coords and strict_attrs work the same way at the coordinate and attribute level:

schema = pa.DatasetSchema(
    data_vars={"a": pa.DataVar(dtype=float, dims=("x",))},
    coords={"x": pa.Coordinate()},
    strict_coords=True,
)

ds_one_coord = xr.Dataset(
    {"a": (("x",), np.ones(3))},
    coords={"x": np.arange(3, dtype=np.float64)},
)
schema.validate(ds_one_coord)
<xarray.Dataset> Size: 48B
Dimensions:  (x: 3)
Coordinates:
  * x        (x) float64 24B 0.0 1.0 2.0
Data variables:
    a        (x) float64 24B 1.0 1.0 1.0

Encoding validationΒΆ

Encoding can be validated at two levels in a DatasetSchema:

  • Per-variable β€” DataVar(encoding=...) validates ds[var].encoding

  • Dataset-level β€” DatasetSchema(encoding=...) validates ds.encoding

Both support dict-based matching (equality, regex, callable) and pydantic models. See Encoding Validation for full details and examples.

Dataset-level checksΒΆ

Checks on the DatasetSchema receive the entire Dataset:

schema = pa.DatasetSchema(
    data_vars={
        "a": pa.DataVar(dtype=float, dims=("x",)),
        "b": pa.DataVar(dtype=float, dims=("x",)),
    },
    checks=pa.Check(lambda ds: bool((ds["a"] < ds["b"]).all())),
)

ds_ordered = xr.Dataset({
    "a": (("x",), [1.0, 2.0, 3.0]),
    "b": (("x",), [4.0, 5.0, 6.0]),
})
schema.validate(ds_ordered)
<xarray.Dataset> Size: 48B
Dimensions:  (x: 3)
Dimensions without coordinates: x
Data variables:
    a        (x) float64 24B 1.0 2.0 3.0
    b        (x) float64 24B 4.0 5.0 6.0

Lazy validationΒΆ

Pass lazy=True to collect all errors into a single SchemaErrors:

schema = pa.DatasetSchema(
    data_vars={
        "temperature": pa.DataVar(dtype=np.float64, dims=("x",)),
    },
    strict=True,
)

ds_bad = xr.Dataset({
    "temperature": (("y",), np.ones(3)),
    "extra_var": (("x",), np.zeros(3)),
})

try:
    schema.validate(ds_bad, lazy=True)
except pa.errors.SchemaErrors as exc:
    print(exc)
{
    "SCHEMA": {
        "COLUMN_NOT_IN_SCHEMA": [
            {
                "schema": "schema",
                "column": null,
                "check": "strict_data_vars",
                "error": "unexpected data variables: ['extra_var']"
            }
        ],
        "MISMATCH_INDEX": [
            {
                "schema": "schema",
                "column": "temperature",
                "check": "dims",
                "error": "dim position 0: expected 'x', got 'y'"
            }
        ]
    }
}

See alsoΒΆ