Encoding ValidationΒΆ
When xarray reads data from netCDF or Zarr, each variable and the dataset
itself carry an .encoding dict that describes how values were serialized on
disk β fill values, scale factors, compression settings, and more. Pandera
lets you validate these encoding dicts at three levels:
Level |
Schema parameter |
Validated against |
|---|---|---|
Per-variable (DataArray) |
|
|
Per-variable (Dataset) |
|
|
Dataset-level |
|
|
DataArray encodingΒΆ
The encoding parameter on DataArraySchema
validates the DataArrayβs .encoding dict.
Dict-based validationΒΆ
Values in the dict are matched using the same rules as attrs:
Literal values β matched by equality.
Regex patterns β strings starting with
^are matched viare.fullmatch.Callable predicates β
(value) -> bool.
import numpy as np
import xarray as xr
import pandera.xarray as pa
da = xr.DataArray(np.arange(6, dtype="float64"), dims="x")
da.encoding = {
"_FillValue": -999.0,
"dtype": "float32",
"scale_factor": 0.01,
}
schema = pa.DataArraySchema(
encoding={
"_FillValue": -999.0,
"dtype": "^float.*",
"scale_factor": lambda v: 0 < v < 1,
},
)
schema.validate(da)
<xarray.DataArray (x: 6)> Size: 48B array([0., 1., 2., 3., 4., 5.]) Dimensions without coordinates: x
When a key is missing or a value doesnβt match:
da_bad = xr.DataArray(np.ones(3), dims="x")
da_bad.encoding = {"dtype": "int32"}
try:
pa.DataArraySchema(
encoding={"dtype": "^float.*"},
).validate(da_bad, lazy=True)
except pa.errors.SchemaErrors as exc:
print(exc)
{
"SCHEMA": {
"SCHEMA_COMPONENT_CHECK": [
{
"schema": "schema",
"column": null,
"check": "encoding",
"error": "encoding 'dtype': expected '^float.*', got 'int32'"
}
]
}
}
Pydantic modelΒΆ
For structured validation, pass a pydantic.BaseModel class.
Pandera delegates to pydantic and converts each pydantic error into a
pandera SchemaError:
from pydantic import BaseModel, Field as PydanticField
class NetCDFEncoding(BaseModel):
dtype: str
complevel: int = PydanticField(ge=1, le=9)
scale_factor: float
da_enc = xr.DataArray(np.ones(3), dims="x")
da_enc.encoding = {
"dtype": "float32",
"complevel": 4,
"scale_factor": 0.01,
}
pa.DataArraySchema(encoding=NetCDFEncoding).validate(da_enc)
<xarray.DataArray (x: 3)> Size: 24B array([1., 1., 1.]) Dimensions without coordinates: x
When pydantic validation fails:
da_bad_enc = xr.DataArray(np.ones(3), dims="x")
da_bad_enc.encoding = {
"dtype": "float32",
"complevel": 99,
"scale_factor": "not_a_number",
}
try:
pa.DataArraySchema(encoding=NetCDFEncoding).validate(
da_bad_enc, lazy=True,
)
except pa.errors.SchemaErrors as exc:
print(exc)
{
"SCHEMA": {
"SCHEMA_COMPONENT_CHECK": [
{
"schema": "schema",
"column": null,
"check": "encoding",
"error": "complevel: Input should be less than or equal to 9 [type=less_than_equal]"
},
{
"schema": "schema",
"column": null,
"check": "encoding",
"error": "scale_factor: Input should be a valid number, unable to parse string as a number [type=float_parsing]"
}
]
}
}
Per-variable encoding in a DatasetΒΆ
Use DataVar with encoding=... to
validate per-variable encoding within a
DatasetSchema. This validates against
ds[var_name].encoding:
ds = xr.Dataset({
"temperature": (("x",), np.arange(4.0)),
"pressure": (("x",), np.ones(4)),
})
ds["temperature"].encoding = {
"_FillValue": -999.0,
"scale_factor": 0.01,
}
ds["pressure"].encoding = {
"_FillValue": -999.0,
"zlib": True,
}
schema = pa.DatasetSchema(
data_vars={
"temperature": pa.DataVar(
dims=("x",),
encoding={
"_FillValue": -999.0,
"scale_factor": lambda v: 0 < v < 1,
},
),
"pressure": pa.DataVar(
dims=("x",),
encoding={
"_FillValue": -999.0,
"zlib": True,
},
),
},
)
schema.validate(ds)
<xarray.Dataset> Size: 64B
Dimensions: (x: 4)
Dimensions without coordinates: x
Data variables:
temperature (x) float64 32B 0.0 1.0 2.0 3.0
pressure (x) float64 32B 1.0 1.0 1.0 1.0Pydantic models also work on DataVar(encoding=...):
class VarEncoding(BaseModel):
scale_factor: float
dtype: str
ds2 = xr.Dataset({"temp": (("x",), np.ones(3))})
ds2["temp"].encoding = {"scale_factor": 0.01, "dtype": "float32"}
schema = pa.DatasetSchema(
data_vars={
"temp": pa.DataVar(dims=("x",), encoding=VarEncoding),
},
)
schema.validate(ds2)
<xarray.Dataset> Size: 24B
Dimensions: (x: 3)
Dimensions without coordinates: x
Data variables:
temp (x) float64 24B 1.0 1.0 1.0Dataset-level encodingΒΆ
The encoding parameter on
DatasetSchema validates
ds.encoding β the dataset-level encoding metadata. Common keys include
unlimited_dims and source:
ds_enc = xr.Dataset({"temp": (("x",), np.ones(3))})
ds_enc.encoding = {"unlimited_dims": ["x"]}
schema = pa.DatasetSchema(
data_vars={"temp": pa.DataVar(dims=("x",))},
encoding={"unlimited_dims": ["x"]},
)
schema.validate(ds_enc)
<xarray.Dataset> Size: 24B
Dimensions: (x: 3)
Dimensions without coordinates: x
Data variables:
temp (x) float64 24B 1.0 1.0 1.0Dataset-level encoding also supports pydantic models:
class DatasetEncoding(BaseModel):
unlimited_dims: list[str]
schema = pa.DatasetSchema(
data_vars={"temp": pa.DataVar(dims=("x",))},
encoding=DatasetEncoding,
)
schema.validate(ds_enc)
<xarray.Dataset> Size: 24B
Dimensions: (x: 3)
Dimensions without coordinates: x
Data variables:
temp (x) float64 24B 1.0 1.0 1.0Combining per-variable and dataset-level encodingΒΆ
Per-variable and dataset-level encoding are validated independently and can be used together:
ds_full = xr.Dataset({"temp": (("x",), np.arange(4.0))})
ds_full.encoding = {"unlimited_dims": ["x"]}
ds_full["temp"].encoding = {"_FillValue": -999.0}
schema = pa.DatasetSchema(
data_vars={
"temp": pa.DataVar(
dims=("x",),
encoding={"_FillValue": -999.0},
),
},
encoding={"unlimited_dims": ["x"]},
)
schema.validate(ds_full)
<xarray.Dataset> Size: 32B
Dimensions: (x: 4)
Dimensions without coordinates: x
Data variables:
temp (x) float64 32B 0.0 1.0 2.0 3.0Check.has_encoding() β check-based alternativeΒΆ
For ad hoc validation or dataset-level checks, use
has_encoding():
da = xr.DataArray(np.ones(3), dims="x")
da.encoding = {"_FillValue": -999.0, "dtype": "float32"}
pa.DataArraySchema(
checks=pa.Check.has_encoding({"_FillValue": -999.0}),
).validate(da)
<xarray.DataArray (x: 3)> Size: 24B array([1., 1., 1.]) Dimensions without coordinates: x
Note
The schema-level encoding= parameter is preferred over Check.has_encoding()
when you know the expected encoding upfront. It provides richer error messages
and supports regex, callable, and pydantic matching modes.
Encoding is schema-scopeΒΆ
Encoding validation is classified as schema scope β it runs even under
ValidationDepth.SCHEMA_ONLY and never triggers .compute() on Dask-backed
arrays. This makes it safe for lazy pipelines.
See alsoΒΆ
DataArray Schemas β
DataArraySchemadetailsDataset Schemas β
DatasetSchemaandDataVarDask and Duck Arrays β Dask integration and validation depth
Configuration β validation depth and scope
xarray I/O docs β how xarray populates
.encoding