Encoding ValidationΒΆ

When xarray reads data from netCDF or Zarr, each variable and the dataset itself carry an .encoding dict that describes how values were serialized on disk β€” fill values, scale factors, compression settings, and more. Pandera lets you validate these encoding dicts at three levels:

Level

Schema parameter

Validated against

Per-variable (DataArray)

DataArraySchema(encoding=...)

da.encoding

Per-variable (Dataset)

DataVar(encoding=...)

ds[var].encoding

Dataset-level

DatasetSchema(encoding=...)

ds.encoding

DataArray encodingΒΆ

The encoding parameter on DataArraySchema validates the DataArray’s .encoding dict.

Dict-based validationΒΆ

Values in the dict are matched using the same rules as attrs:

  • Literal values β€” matched by equality.

  • Regex patterns β€” strings starting with ^ are matched via re.fullmatch.

  • Callable predicates β€” (value) -> bool.

import numpy as np
import xarray as xr
import pandera.xarray as pa

da = xr.DataArray(np.arange(6, dtype="float64"), dims="x")
da.encoding = {
    "_FillValue": -999.0,
    "dtype": "float32",
    "scale_factor": 0.01,
}

schema = pa.DataArraySchema(
    encoding={
        "_FillValue": -999.0,
        "dtype": "^float.*",
        "scale_factor": lambda v: 0 < v < 1,
    },
)
schema.validate(da)
<xarray.DataArray (x: 6)> Size: 48B
array([0., 1., 2., 3., 4., 5.])
Dimensions without coordinates: x

When a key is missing or a value doesn’t match:

da_bad = xr.DataArray(np.ones(3), dims="x")
da_bad.encoding = {"dtype": "int32"}

try:
    pa.DataArraySchema(
        encoding={"dtype": "^float.*"},
    ).validate(da_bad, lazy=True)
except pa.errors.SchemaErrors as exc:
    print(exc)
{
    "SCHEMA": {
        "SCHEMA_COMPONENT_CHECK": [
            {
                "schema": "schema",
                "column": null,
                "check": "encoding",
                "error": "encoding 'dtype': expected '^float.*', got 'int32'"
            }
        ]
    }
}

Pydantic modelΒΆ

For structured validation, pass a pydantic.BaseModel class. Pandera delegates to pydantic and converts each pydantic error into a pandera SchemaError:

from pydantic import BaseModel, Field as PydanticField

class NetCDFEncoding(BaseModel):
    dtype: str
    complevel: int = PydanticField(ge=1, le=9)
    scale_factor: float
da_enc = xr.DataArray(np.ones(3), dims="x")
da_enc.encoding = {
    "dtype": "float32",
    "complevel": 4,
    "scale_factor": 0.01,
}

pa.DataArraySchema(encoding=NetCDFEncoding).validate(da_enc)
<xarray.DataArray (x: 3)> Size: 24B
array([1., 1., 1.])
Dimensions without coordinates: x

When pydantic validation fails:

da_bad_enc = xr.DataArray(np.ones(3), dims="x")
da_bad_enc.encoding = {
    "dtype": "float32",
    "complevel": 99,
    "scale_factor": "not_a_number",
}

try:
    pa.DataArraySchema(encoding=NetCDFEncoding).validate(
        da_bad_enc, lazy=True,
    )
except pa.errors.SchemaErrors as exc:
    print(exc)
{
    "SCHEMA": {
        "SCHEMA_COMPONENT_CHECK": [
            {
                "schema": "schema",
                "column": null,
                "check": "encoding",
                "error": "complevel: Input should be less than or equal to 9 [type=less_than_equal]"
            },
            {
                "schema": "schema",
                "column": null,
                "check": "encoding",
                "error": "scale_factor: Input should be a valid number, unable to parse string as a number [type=float_parsing]"
            }
        ]
    }
}

Per-variable encoding in a DatasetΒΆ

Use DataVar with encoding=... to validate per-variable encoding within a DatasetSchema. This validates against ds[var_name].encoding:

ds = xr.Dataset({
    "temperature": (("x",), np.arange(4.0)),
    "pressure": (("x",), np.ones(4)),
})
ds["temperature"].encoding = {
    "_FillValue": -999.0,
    "scale_factor": 0.01,
}
ds["pressure"].encoding = {
    "_FillValue": -999.0,
    "zlib": True,
}

schema = pa.DatasetSchema(
    data_vars={
        "temperature": pa.DataVar(
            dims=("x",),
            encoding={
                "_FillValue": -999.0,
                "scale_factor": lambda v: 0 < v < 1,
            },
        ),
        "pressure": pa.DataVar(
            dims=("x",),
            encoding={
                "_FillValue": -999.0,
                "zlib": True,
            },
        ),
    },
)
schema.validate(ds)
<xarray.Dataset> Size: 64B
Dimensions:      (x: 4)
Dimensions without coordinates: x
Data variables:
    temperature  (x) float64 32B 0.0 1.0 2.0 3.0
    pressure     (x) float64 32B 1.0 1.0 1.0 1.0

Pydantic models also work on DataVar(encoding=...):

class VarEncoding(BaseModel):
    scale_factor: float
    dtype: str

ds2 = xr.Dataset({"temp": (("x",), np.ones(3))})
ds2["temp"].encoding = {"scale_factor": 0.01, "dtype": "float32"}

schema = pa.DatasetSchema(
    data_vars={
        "temp": pa.DataVar(dims=("x",), encoding=VarEncoding),
    },
)
schema.validate(ds2)
<xarray.Dataset> Size: 24B
Dimensions:  (x: 3)
Dimensions without coordinates: x
Data variables:
    temp     (x) float64 24B 1.0 1.0 1.0

Dataset-level encodingΒΆ

The encoding parameter on DatasetSchema validates ds.encoding β€” the dataset-level encoding metadata. Common keys include unlimited_dims and source:

ds_enc = xr.Dataset({"temp": (("x",), np.ones(3))})
ds_enc.encoding = {"unlimited_dims": ["x"]}

schema = pa.DatasetSchema(
    data_vars={"temp": pa.DataVar(dims=("x",))},
    encoding={"unlimited_dims": ["x"]},
)
schema.validate(ds_enc)
<xarray.Dataset> Size: 24B
Dimensions:  (x: 3)
Dimensions without coordinates: x
Data variables:
    temp     (x) float64 24B 1.0 1.0 1.0

Dataset-level encoding also supports pydantic models:

class DatasetEncoding(BaseModel):
    unlimited_dims: list[str]

schema = pa.DatasetSchema(
    data_vars={"temp": pa.DataVar(dims=("x",))},
    encoding=DatasetEncoding,
)
schema.validate(ds_enc)
<xarray.Dataset> Size: 24B
Dimensions:  (x: 3)
Dimensions without coordinates: x
Data variables:
    temp     (x) float64 24B 1.0 1.0 1.0

Combining per-variable and dataset-level encodingΒΆ

Per-variable and dataset-level encoding are validated independently and can be used together:

ds_full = xr.Dataset({"temp": (("x",), np.arange(4.0))})
ds_full.encoding = {"unlimited_dims": ["x"]}
ds_full["temp"].encoding = {"_FillValue": -999.0}

schema = pa.DatasetSchema(
    data_vars={
        "temp": pa.DataVar(
            dims=("x",),
            encoding={"_FillValue": -999.0},
        ),
    },
    encoding={"unlimited_dims": ["x"]},
)
schema.validate(ds_full)
<xarray.Dataset> Size: 32B
Dimensions:  (x: 4)
Dimensions without coordinates: x
Data variables:
    temp     (x) float64 32B 0.0 1.0 2.0 3.0

Check.has_encoding() β€” check-based alternativeΒΆ

For ad hoc validation or dataset-level checks, use has_encoding():

da = xr.DataArray(np.ones(3), dims="x")
da.encoding = {"_FillValue": -999.0, "dtype": "float32"}

pa.DataArraySchema(
    checks=pa.Check.has_encoding({"_FillValue": -999.0}),
).validate(da)
<xarray.DataArray (x: 3)> Size: 24B
array([1., 1., 1.])
Dimensions without coordinates: x

Note

The schema-level encoding= parameter is preferred over Check.has_encoding() when you know the expected encoding upfront. It provides richer error messages and supports regex, callable, and pydantic matching modes.

Encoding is schema-scopeΒΆ

Encoding validation is classified as schema scope β€” it runs even under ValidationDepth.SCHEMA_ONLY and never triggers .compute() on Dask-backed arrays. This makes it safe for lazy pipelines.

See alsoΒΆ