Error reports and lazy validation¶
Pandera’s xarray backend raises the same exception types as other backends:
SchemaError when validation stops at the first problem,
and SchemaErrors when you pass lazy=True to collect every
failure in one pass. The consolidated summary in SchemaErrors is the same
error report idea described in Error Reports and Lazy Validation,
adapted to labelled N-dimensional arrays.
This page shows schema failures (structure, dtype, dims, metadata) versus
data failures (checks, nullability when it is treated as data-scope), how to
read lazy=True output, and what failure_cases look like for
DataArray validation.
Eager validation: SchemaError¶
By default, validate() raises as soon as a rule fails. Structural problems
(wrong dtype, dims, name, missing coordinates, and so on) and data-level
failures (failing Check, nulls when nullable=False)
both use SchemaError, but they correspond to different reason codes
(see SchemaErrorReason).
Schema-level failure¶
import numpy as np
import xarray as xr
import pandera.xarray as pa
schema = pa.DataArraySchema(
dtype=np.float64,
dims=("time", "lat"),
name="temperature",
)
da = xr.DataArray(
np.zeros((3, 4)),
dims=("time", "lon"), # wrong dim name
name="temperature",
)
try:
schema.validate(da)
except pa.errors.SchemaError as exc:
print("reason:", exc.reason_code)
print("check:", exc.check)
print("message:", exc)
reason: SchemaErrorReason.MISMATCH_INDEX
check: dims
message: dim position 1: expected 'lat', got 'lon'
Here reason_code is typically MISMATCH_INDEX for dimension/shape/coord
mismatches (historical name shared with pandas), or WRONG_DATATYPE,
WRONG_FIELD_NAME, etc., for other structural issues.
Data-level failure (Check)¶
schema = pa.DataArraySchema(
dtype=np.float64,
dims=("x",),
checks=pa.Check.in_range(0, 1),
)
da = xr.DataArray([0.0, 2.0, 0.5], dims=("x",))
try:
schema.validate(da)
except pa.errors.SchemaError as exc:
print("reason:", exc.reason_code)
print("check:", exc.check)
reason: SchemaErrorReason.DATAFRAME_CHECK
check: <Check in_range: in_range(0, 1)>
For element-wise checks, exc.failure_cases is often an DataArray
of the same shape as the data, with NaN where values passed and the failing
values kept where the check failed (a masked view of the original array).
try:
schema.validate(da)
except pa.errors.SchemaError as exc:
print(exc.failure_cases)
<xarray.DataArray (x: 3)> Size: 24B
array([nan, 2., nan])
Dimensions without coordinates: x
Lazy validation: SchemaErrors and lazy=True¶
Pass lazy=True to run all applicable checks and raise a single
SchemaErrors. Its string form is JSON, and the
.message attribute holds the same structured dict.
import json
schema = pa.DataArraySchema(
dtype=np.float64,
dims=("x",),
name="values",
checks=pa.Check.ge(0),
)
da = xr.DataArray([-1.0, 2.0, 3.0], dims="x", name="wrong_name")
try:
schema.validate(da, lazy=True)
except pa.errors.SchemaErrors as exc:
print(json.dumps(exc.message, indent=2))
{
"SCHEMA": {
"WRONG_FIELD_NAME": [
{
"schema": "values",
"column": "values",
"check": "name",
"error": "expected name 'values', got 'wrong_name'"
}
]
},
"DATA": {
"DATAFRAME_CHECK": [
{
"schema": "values",
"column": "values",
"check": "greater_than_or_equal_to(0)",
"error": "DataArraySchema 'values' failed element-wise validator number 0: greater_than_or_equal_to(0) failure cases: -1.0"
}
]
}
}
How to read exc.message¶
The summary is a nested dict:
Top level —
SCHEMAvsDATA(names ofErrorCategory).SCHEMA — structural validation (dtype, dims, coords, attrs, encoding, name,
chunked/array_type, strict flags, etc.).DATA — data-scope rules such as user
Checkfailures and certain nullable / duplicate semantics, depending on reason code.
Second level — reason codes as strings, e.g.
WRONG_FIELD_NAME,WRONG_DATATYPE,MISMATCH_INDEX,DATAFRAME_CHECK(check failures; name shared with pandas),SERIES_CONTAINS_NULLS, etc.Entries — each failure is a small dict with:
schema— the schema’sname(or a sensible label).column— the data variable or coordinate key involved (for a standaloneDataArray, this matches the arraynamewhen set).check— structural check id (e.g."name","dims") or the check repr (e.g."greater_than_or_equal_to(0)").error— short string describing the failure.
DATAFRAME_CHECK appears for generic check pipeline failures on non-dataframe
objects; treat it as “this Check failed,” not as a pandas-only concept.
error_counts¶
try:
schema.validate(da, lazy=True)
except pa.errors.SchemaErrors as exc:
print(exc.error_counts)
{'WRONG_FIELD_NAME': 1, 'DATAFRAME_CHECK': 1}
This maps each reason code to how many errors of that type were collected.
schema_errors vs failure_cases on SchemaErrors¶
exc.schema_errors— list ofSchemaErrorinstances in collection order. Use this for programmatic access: each item hasreason_code,check,failure_cases,message,schema, anddata(may be cleared in lazy mode for large objects — rely on the error fields).exc.failure_cases— a simple list of strings, one human-readable summary per collected error (aligned withschema_errors). Handy for logging; for masks and coordinates, inspect eachSchemaErrorinschema_errors.
try:
schema.validate(da, lazy=True)
except pa.errors.SchemaErrors as exc:
for err in exc.schema_errors:
print(err.reason_code.name, "->", type(err.failure_cases).__name__)
WRONG_FIELD_NAME -> str
DATAFRAME_CHECK -> DataArray
For a failing element-wise check, err.failure_cases is often a
DataArray with coordinates preserved so you can locate bad
points in label space.
Datasets¶
DatasetSchema validation aggregates errors from
dataset-level rules and from each data_vars / coords slice. Lazy reports
use the same SCHEMA / DATA grouping; column identifies the data
variable or coordinate name that failed.
import json
ds_schema = pa.DatasetSchema(
data_vars={
"a": pa.DataVar(dtype=np.float64, dims=("x",)),
"b": pa.DataVar(dtype=np.float64, dims=("x",)),
},
dims=("x",),
sizes={"x": 3},
)
ds = xr.Dataset(
{
"a": ("x", [1.0, 2.0, 3.0]),
"b": ("x", np.array([1, 2, 3], dtype=np.int64)), # schema expects float64
},
)
try:
ds_schema.validate(ds, lazy=True)
except pa.errors.SchemaErrors as exc:
print(json.dumps(exc.message, indent=2))
{
"SCHEMA": {
"WRONG_DATATYPE": [
{
"schema": "schema",
"column": "b",
"check": "dtype(<class 'numpy.float64'>)",
"error": "expected dtype <class 'numpy.float64'>, got int64"
}
]
}
}
Validation depth and what appears in the report¶
Global ValidationDepth controls which scopes run
(e.g. SCHEMA_ONLY skips user checks on chunked arrays by default). Errors from
skipped scopes do not appear in SchemaErrors. See
Configuration and Dask and Duck Arrays.
With SCHEMA_ONLY, a value that only fails a user Check may pass
validation (no exception):
from pandera.config import ValidationDepth, config_context
schema = pa.DataArraySchema(
dims=("x",),
name="a",
checks=pa.Check.ge(0),
)
da = xr.DataArray([-1.0], dims="x", name="a")
with config_context(validation_depth=ValidationDepth.SCHEMA_ONLY):
out = schema.validate(da)
print("SCHEMA_ONLY: validation returned", type(out).__name__)
SCHEMA_ONLY: validation returned DataArray
With the default depth (SCHEMA_AND_DATA for eager arrays), the same data
raises and the failure shows up under DATA:
try:
schema.validate(da, lazy=True)
except pa.errors.SchemaErrors as exc:
print("error_counts:", exc.error_counts)
error_counts: {'DATAFRAME_CHECK': 1}
See also¶
Error Reports — error reports for pandas / PySpark
Lazy Validation — lazy validation concepts (pandas-oriented)
Checks and Parsers — checks, parsers, and lazy validation on xarray
Configuration —
ValidationDepthand environment variablesDask and Duck Arrays — Dask-backed arrays and default schema-only data checks