DataTree ValidationΒΆ

DataTree is a hierarchical tree of datasets, useful for organising related multi-dimensional data under a single structure. Pandera validates trees with DataTreeSchema (imperative API) and DataTreeModel (declarative API).

DataTreeSchemaΒΆ

Basic usageΒΆ

DataTreeSchema validates node-level attributes and child nodes. Each child schema can be a DatasetSchema or another DataTreeSchema for recursive nesting.

import numpy as np
import xarray as xr
import pandera.xarray as pa

schema = pa.DataTreeSchema(
    attrs={"conventions": "CF-1.8"},
    children={
        "surface": pa.DatasetSchema(
            data_vars={
                "temperature": pa.DataVar(dtype=np.float64, dims=("x",)),
            },
        ),
        "upper": pa.DatasetSchema(
            data_vars={
                "wind": pa.DataVar(dtype=np.float64, dims=("x",)),
            },
        ),
    },
)

dt = xr.DataTree.from_dict({
    "/": xr.Dataset(attrs={"conventions": "CF-1.8"}),
    "/surface": xr.Dataset(
        {"temperature": (("x",), np.ones(3, dtype=np.float64))},
        coords={"x": np.arange(3, dtype=np.float64)},
    ),
    "/upper": xr.Dataset(
        {"wind": (("x",), np.ones(3, dtype=np.float64))},
        coords={"x": np.arange(3, dtype=np.float64)},
    ),
})
schema.validate(dt)
<xarray.DataTree>
Group: /
β”‚   Attributes:
β”‚       conventions:  CF-1.8
β”œβ”€β”€ Group: /surface
β”‚       Dimensions:      (x: 3)
β”‚       Coordinates:
β”‚         * x            (x) float64 24B 0.0 1.0 2.0
β”‚       Data variables:
β”‚           temperature  (x) float64 24B 1.0 1.0 1.0
└── Group: /upper
        Dimensions:  (x: 3)
        Coordinates:
          * x        (x) float64 24B 0.0 1.0 2.0
        Data variables:
            wind     (x) float64 24B 1.0 1.0 1.0

Path-based childrenΒΆ

Children can reference nested nodes using /-separated paths, just like xr.DataTree.from_dict():

nested_dt = xr.DataTree.from_dict({
    "/": xr.Dataset(attrs={"conventions": "CF-1.8"}),
    "/surface": xr.Dataset(
        {"temperature": (("x",), np.ones(3, dtype=np.float64))},
        coords={"x": np.arange(3, dtype=np.float64)},
    ),
    "/surface/diagnostics": xr.Dataset(
        {"rmse": (("x",), np.ones(3, dtype=np.float64))},
        coords={"x": np.arange(3, dtype=np.float64)},
    ),
})

schema = pa.DataTreeSchema(
    children={
        "surface/diagnostics": pa.DatasetSchema(
            data_vars={"rmse": pa.DataVar(dtype=np.float64)},
        ),
    },
)
schema.validate(nested_dt)
<xarray.DataTree>
Group: /
β”‚   Attributes:
β”‚       conventions:  CF-1.8
└── Group: /surface
    β”‚   Dimensions:      (x: 3)
    β”‚   Coordinates:
    β”‚     * x            (x) float64 24B 0.0 1.0 2.0
    β”‚   Data variables:
    β”‚       temperature  (x) float64 24B 1.0 1.0 1.0
    └── Group: /surface/diagnostics
            Dimensions:  (x: 3)
            Data variables:
                rmse     (x) float64 24B 1.0 1.0 1.0

Root node datasetΒΆ

Use the dataset parameter to validate the dataset attached to the root node:

schema = pa.DataTreeSchema(
    dataset=pa.DatasetSchema(attrs={"conventions": "CF-1.8"}),
    children={
        "surface": pa.DatasetSchema(
            data_vars={
                "temperature": pa.DataVar(dtype=np.float64, dims=("x",)),
            },
        ),
    },
)
schema.validate(dt)
<xarray.DataTree>
Group: /
β”‚   Attributes:
β”‚       conventions:  CF-1.8
β”œβ”€β”€ Group: /surface
β”‚       Dimensions:      (x: 3)
β”‚       Coordinates:
β”‚         * x            (x) float64 24B 0.0 1.0 2.0
β”‚       Data variables:
β”‚           temperature  (x) float64 24B 1.0 1.0 1.0
└── Group: /upper
        Dimensions:  (x: 3)
        Coordinates:
          * x        (x) float64 24B 0.0 1.0 2.0
        Data variables:
            wind     (x) float64 24B 1.0 1.0 1.0

Strict modeΒΆ

When strict=True, unexpected child nodes raise a validation error:

schema = pa.DataTreeSchema(
    children={
        "surface": pa.DatasetSchema(),
        "upper": pa.DatasetSchema(),
    },
    strict=True,
)
schema.validate(dt)
<xarray.DataTree>
Group: /
β”‚   Attributes:
β”‚       conventions:  CF-1.8
β”œβ”€β”€ Group: /surface
β”‚       Dimensions:      (x: 3)
β”‚       Coordinates:
β”‚         * x            (x) float64 24B 0.0 1.0 2.0
β”‚       Data variables:
β”‚           temperature  (x) float64 24B 1.0 1.0 1.0
└── Group: /upper
        Dimensions:  (x: 3)
        Coordinates:
          * x        (x) float64 24B 0.0 1.0 2.0
        Data variables:
            wind     (x) float64 24B 1.0 1.0 1.0
strict_schema = pa.DataTreeSchema(
    children={"surface": pa.DatasetSchema()},
    strict=True,
)

try:
    strict_schema.validate(dt)
except pa.errors.SchemaError as exc:
    print(exc)
unexpected child node 'upper'

Nested DataTreeSchemaΒΆ

Children can themselves be DataTreeSchema instances for deep validation:

schema = pa.DataTreeSchema(
    attrs={"conventions": "CF-1.8"},
    children={
        "surface": pa.DataTreeSchema(
            dataset=pa.DatasetSchema(
                data_vars={
                    "temperature": pa.DataVar(dtype=np.float64, dims=("x",)),
                },
            ),
            children={
                "diagnostics": pa.DatasetSchema(
                    data_vars={"rmse": pa.DataVar(dtype=np.float64)},
                ),
            },
        ),
    },
)
schema.validate(nested_dt)
<xarray.DataTree>
Group: /
β”‚   Attributes:
β”‚       conventions:  CF-1.8
└── Group: /surface
    β”‚   Dimensions:      (x: 3)
    β”‚   Coordinates:
    β”‚     * x            (x) float64 24B 0.0 1.0 2.0
    β”‚   Data variables:
    β”‚       temperature  (x) float64 24B 1.0 1.0 1.0
    └── Group: /surface/diagnostics
            Dimensions:  (x: 3)
            Data variables:
                rmse     (x) float64 24B 1.0 1.0 1.0

DataTreeModelΒΆ

Basic usageΒΆ

DataTreeModel uses class attributes annotated with DatasetModel subclasses to declare child node schemas:

from pandera.typing.xarray import Coordinate

class SurfaceModel(pa.DatasetModel):
    temperature: np.float64 = pa.Field(dims=("x",))
    x: Coordinate[np.float64]

class UpperModel(pa.DatasetModel):
    wind: np.float64 = pa.Field(dims=("x",))
    x: Coordinate[np.float64]

class ClimateTree(pa.DataTreeModel):
    surface: SurfaceModel
    upper: UpperModel

    class Config:
        strict = True

ClimateTree.validate(dt)
<xarray.DataTree>
Group: /
β”‚   Attributes:
β”‚       conventions:  CF-1.8
β”œβ”€β”€ Group: /surface
β”‚       Dimensions:      (x: 3)
β”‚       Coordinates:
β”‚         * x            (x) float64 24B 0.0 1.0 2.0
β”‚       Data variables:
β”‚           temperature  (x) float64 24B 1.0 1.0 1.0
└── Group: /upper
        Dimensions:  (x: 3)
        Coordinates:
          * x        (x) float64 24B 0.0 1.0 2.0
        Data variables:
            wind     (x) float64 24B 1.0 1.0 1.0

Config optionsΒΆ

DataTreeModel.Config (DataTreeConfig) accepts: strict, attrs, name.

Field name accessΒΆ

print(ClimateTree.surface)
print(ClimateTree.upper)
surface
upper

to_schema() and validate()ΒΆ

schema = ClimateTree.to_schema()
print(type(schema))

ClimateTree.validate(dt)
<class 'pandera.api.xarray.container.DataTreeSchema'>
<xarray.DataTree>
Group: /
β”‚   Attributes:
β”‚       conventions:  CF-1.8
β”œβ”€β”€ Group: /surface
β”‚       Dimensions:      (x: 3)
β”‚       Coordinates:
β”‚         * x            (x) float64 24B 0.0 1.0 2.0
β”‚       Data variables:
β”‚           temperature  (x) float64 24B 1.0 1.0 1.0
└── Group: /upper
        Dimensions:  (x: 3)
        Coordinates:
          * x        (x) float64 24B 0.0 1.0 2.0
        Data variables:
            wind     (x) float64 24B 1.0 1.0 1.0

@check_types with DataTreeΒΆ

Use DataTree[Model] from pandera.typing.xarray with the @check_types decorator:

from pandera.typing.xarray import DataTree

@pa.check_types
def process_tree(tree: DataTree[ClimateTree]) -> DataTree[ClimateTree]:
    return tree

process_tree(dt)
<xarray.DataTree>
Group: /
β”‚   Attributes:
β”‚       conventions:  CF-1.8
β”œβ”€β”€ Group: /surface
β”‚       Dimensions:      (x: 3)
β”‚       Coordinates:
β”‚         * x            (x) float64 24B 0.0 1.0 2.0
β”‚       Data variables:
β”‚           temperature  (x) float64 24B 1.0 1.0 1.0
└── Group: /upper
        Dimensions:  (x: 3)
        Coordinates:
          * x        (x) float64 24B 0.0 1.0 2.0
        Data variables:
            wind     (x) float64 24B 1.0 1.0 1.0

See alsoΒΆ