DataFrame Models¶

new in 0.5.0

pandera provides a class-based API that’s heavily inspired by pydantic. In contrast to the object-based API, you can define dataframe models in much the same way you’d define pydantic models.

DataFrameModel s are annotated with the pandera.typing module using the standard typing syntax. Models can be explicitly converted to a DataFrameSchema or used to validate a DataFrame directly.

Note

Due to current limitations in the pandas library (see discussion here), pandera annotations are only used for run-time validation and has limited support for static-type checkers like mypy. See the Mypy Integration for more details.

Basic Usage¶

import pandas as pd
import pandera.pandas as pa
from pandera.typing.pandas import Index, DataFrame, Series


class InputSchema(pa.DataFrameModel):
    year: Series[int] = pa.Field(gt=2000, coerce=True)
    month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
    day: Series[int] = pa.Field(ge=0, le=365, coerce=True)

class OutputSchema(InputSchema):
    revenue: Series[float]

@pa.check_types
def transform(df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
    return df.assign(revenue=100.0)


df = pd.DataFrame({
    "year": ["2001", "2002", "2003"],
    "month": ["3", "6", "12"],
    "day": ["200", "156", "365"],
})

transform(df)

invalid_df = pd.DataFrame({
    "year": ["2001", "2002", "1999"],
    "month": ["3", "6", "12"],
    "day": ["200", "156", "365"],
})

try:
    transform(invalid_df)
except pa.errors.SchemaError as exc:
    print(exc)

Column 'year' failed element-wise validator number 0: greater_than(2000) failure cases: 1999

As you can see in the examples above, you can define a schema by sub-classing DataFrameModel and defining column/index fields as class attributes. The check_types() decorator is required to perform validation of the dataframe at run-time.

The Field() class is used to define the schema specification for a column or index.

Note that Field s apply to both Column and Index objects, exposing the built-in Check s via key-word arguments.

(New in 0.6.2) When you access a class attribute defined on the schema, it will return the name of the column used in the validated pd.DataFrame. In the example above, this will simply be the string "year".

print(f"Column name for 'year' is {InputSchema.year}\n")
print(df.loc[:, [InputSchema.year, "day"]])

Column name for 'year' is year

   year  day
0  2001  200
1  2002  156
2  2003  365

Using Data Types directly for Column Type Annotations¶

new in 0.15.0

For conciseness, you can also use type annotations for columns without using the Series generic. This class attributes will be interpreted as Column objects under the hood.

class InputSchema(pa.DataFrameModel):
    year: int = pa.Field(gt=2000, coerce=True)
    month: int = pa.Field(ge=1, le=12, coerce=True)
    day: int = pa.Field(ge=0, le=365, coerce=True)

Reusing Field objects¶

To define reusable Field definitions, you need to use functools.partial. This makes sure that each field attribute is bound to a unique Field instance.

from functools import partial
from pandera.pandas import DataFrameModel, Field

NormalizedField = partial(Field, ge=0, le=1)

class SchemaWithReusedFields(DataFrameModel):
    xnorm: float = NormalizedField()
    ynorm: float = NormalizedField()

Validate on Initialization¶

new in 0.8.0

Pandera provides an interface for validating dataframes on initialization. This API uses the pandera.typing.pandas.DataFrame generic type to validated against the DataFrameModel type variable on initialization:

import pandas as pd
import pandera.pandas as pa

from pandera.typing import DataFrame, Series


class Schema(pa.DataFrameModel):
    state: Series[str]
    city: Series[str]
    price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})

DataFrame[Schema](
    {
        'state': ['NY','FL','GA','CA'],
        'city': ['New York', 'Miami', 'Atlanta', 'San Francisco'],
        'price': [8, 12, 10, 16],
    }
)

	state	city	price
0	NY	New York	8
1	FL	Miami	12
2	GA	Atlanta	10
3	CA	San Francisco	16

Refer to Supported DataFrame Libraries to see how this syntax applies to other supported dataframe types.

GeoPandas `GeoDataFrameModel`¶

For geopandas.GeoDataFrame workflows, use import pandera.geopandas as pg (the module includes the full :mod:pandera.pandas API) and subclass GeoDataFrameModel instead of DataFrameModel when you need validate() (and example(), empty()) to return a GeoDataFrame, preserving active geometry and CRS metadata after validation. For the object-based API, use GeoDataFrameSchema. Field definitions, checks, parsers, and Config work the same as for DataFrameModel; see Data Validation with GeoPandas and :ref:api-geopandas.

Converting to DataFrameSchema¶

You can easily convert a DataFrameModel class into a DataFrameSchema:

print(InputSchema.to_schema())

<Schema DataFrameSchema(
    columns={
        'year': <Schema Column(name=year, type=DataType(int64))>
        'month': <Schema Column(name=month, type=DataType(int64))>
        'day': <Schema Column(name=day, type=DataType(int64))>
    },
    checks=[],
    parsers=[],
    coerce=False,
    dtype=None,
    index=None,
    strict=False,
    name=InputSchema,
    ordered=False,
    unique_column_names=False,
    metadata=None, 
    add_missing_columns=False
)>

You can also use the validate() method to validate dataframes:

print(InputSchema.validate(df))

   year  month  day
2001      3  200
2002      6  156
2003     12  365

Or you can use the DataFrameModel() class directly to validate dataframes, which is syntactic sugar that simply delegates to the validate() method.

print(InputSchema(df))

   year  month  day
2001      3  200
2002      6  156
2003     12  365

Validate Against Multiple Schemas¶

new in 0.14.0

The built-in typing.Union type is supported for multiple DataFrame schemas.

from typing import Union
import pandas as pd
import pandera.pandas as pa
from pandera.typing import DataFrame, Series

class OnlyZeroesSchema(pa.DataFrameModel):
    a: Series[int] = pa.Field(eq=0)

class OnlyOnesSchema(pa.DataFrameModel):
    a: Series[int] = pa.Field(eq=1)

@pa.check_types
def return_zeros_or_ones(
    df: Union[DataFrame[OnlyZeroesSchema], DataFrame[OnlyOnesSchema]]
) -> Union[DataFrame[OnlyZeroesSchema], DataFrame[OnlyOnesSchema]]:
    return df

# passes
return_zeros_or_ones(pd.DataFrame({"a": [0, 0]}))
return_zeros_or_ones(pd.DataFrame({"a": [1, 1]}))

# fails
try:
    return_zeros_or_ones(pd.DataFrame({"a": [0, 2]}))
except pa.errors.SchemaErrors as exc:
    print(exc)

{
    "DATA": {
        "DATAFRAME_CHECK": [
            {
                "schema": "OnlyOnesSchema",
                "column": "a",
                "check": "equal_to(0)",
                "error": "Column 'a' failed element-wise validator number 0: equal_to(0) failure cases: 2"
            },
            {
                "schema": "OnlyOnesSchema",
                "column": "a",
                "check": "equal_to(1)",
                "error": "Column 'a' failed element-wise validator number 0: equal_to(1) failure cases: 0, 2"
            }
        ]
    }
}

Note that mixtures of DataFrame schemas and built-in types will ignore checking built-in types with pandera. Pydantic should be used to check and/or coerce any built-in types.

import pandas as pd
from typing import Union
import pandera.pandas as pa
from pandera.typing import DataFrame, Series

class OnlyZeroesSchema(pa.DataFrameModel):
    a: Series[int] = pa.Field(eq=0)


@pa.check_types
def df_and_int_types(

    val: Union[DataFrame[OnlyZeroesSchema], int]
) -> Union[DataFrame[OnlyZeroesSchema], int]:
    return val


df_and_int_types(pd.DataFrame({"a": [0, 0]}))
int_val = df_and_int_types(5)
str_val = df_and_int_types("5")

no_pydantic_report = f"No Pydantic: {isinstance(int_val, int)}, {isinstance(str_val, int)}"


@pa.check_types(with_pydantic=True)
def df_and_int_types_with_pydantic(
    val: Union[DataFrame[OnlyZeroesSchema], int]
) -> Union[DataFrame[OnlyZeroesSchema], int]:
    return val


df_and_int_types_with_pydantic(pd.DataFrame({"a": [0, 0]}))
int_val_w_pyd = df_and_int_types_with_pydantic(5)
str_val_w_pyd = df_and_int_types_with_pydantic("5")

pydantic_report = f"With Pydantic: {isinstance(int_val_w_pyd, int)}, {isinstance(str_val_w_pyd, int)}"

print(no_pydantic_report)
print(pydantic_report)

No Pydantic: True, False
With Pydantic: True, True

Excluded attributes¶

Class variables which begin with an underscore will be automatically excluded from the model. Config is also a reserved name. However, aliases can be used to circumvent these limitations.

Supported dtypes¶

Any dtypes supported by pandera can be used as type parameters for Series and Index. There are, however, a couple of gotchas.

Important

You can learn more about how data type validation works Data Type Validation.

Dtype aliases¶

import pandera.pandas as pa
from pandera.typing import Series, String

class Schema(pa.DataFrameModel):
    a: Series[String]

Type Vs instance¶

You must give a type, not an instance.

✅ Good:

import pandas as pd

class Schema(pa.DataFrameModel):
    a: Series[pd.StringDtype]

❌ Bad:

Note

This is only applicable for pandas versions < 2.0.0. In pandas > 2.0.0, pd.StringDtype() will produce a type.

class Schema(pa.DataFrameModel):
    a: Series[pd.StringDtype()]

Parametrized dtypes¶

Pandas supports a couple of parametrized dtypes. As of pandas 1.2.0:

Kind of Data	Data Type	Parameters
tz-aware datetime	`DatetimeTZDtype`	`unit`, `tz`
Categorical	`CategoricalDtype`	`categories`, `ordered`
period	`PeriodDtype`	`freq`
sparse	`SparseDtype`	`dtype`, `fill_value`
intervals	`IntervalDtype`	`subtype`

Annotated¶

Parameters can be given via typing.Annotated. It requires python >= 3.9 or typing_extensions, which is already a requirement of Pandera. Unfortunately typing.Annotated has not been backported to python 3.6.

✅ Good:

try:
    from typing import Annotated  # python 3.9+
except ImportError:
    from typing_extensions import Annotated

class Schema(pa.DataFrameModel):
    col: Series[Annotated[pd.DatetimeTZDtype, "ns", "est"]]

Furthermore, you must pass all parameters in the order defined in the dtype’s constructor (see table).

❌ Bad:

class Schema(pa.DataFrameModel):
    col: Series[Annotated[pd.DatetimeTZDtype, "utc"]]

Schema.to_schema()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[14], line 4
class Schema(pa.DataFrameModel):
   col: Series[Annotated[pd.DatetimeTZDtype, "utc"]]

----> 4 Schema.to_schema()

File ~/checkouts/readthedocs.org/user_builds/pandera/checkouts/stable/pandera/api/dataframe/model.py:351, in DataFrameModel.to_schema(cls)
@classmethod
def to_schema(cls) -> TSchema:
   """Create :class:`~pandera.DataFrameSchema` from the :class:`.DataFrameModel`."""
--> 351     return cls.__schema__

File ~/checkouts/readthedocs.org/user_builds/pandera/checkouts/stable/pandera/api/dataframe/model.py:193, in _SchemaDescriptor.__get__(self, obj, cls)
   kwargs = {}
try:
--> 193     MODEL_CACHE[(cls, thread_id)] = cls.build_schema_(**kwargs)
except NotImplementedError as e:
   # Raise AttributeError to signal that this attribute is not available
   # for abstract/incomplete model classes. This allows introspection tools
   # like pydoc to continue without crashing.
   raise AttributeError(
       f"'{cls.__name__}' does not implement build_schema_() and cannot "
       f"generate a schema. To be able to generate a schema, subclass the"
       "DataFrameModel for a specific backend."
   ) from e

File ~/checkouts/readthedocs.org/user_builds/pandera/checkouts/stable/pandera/api/pandas/model.py:91, in DataFrameModel.build_schema_(cls, **kwargs)
@classmethod
def build_schema_(cls, **kwargs) -> DataFrameSchema:
   multiindex_kwargs = {
       name[len("multiindex_") :]: value
       for name, value in vars(cls.__config__).items()
       if name.startswith("multiindex_")
   }
---> 91     columns, index = cls._build_columns_index(
       cls.__fields__,
       cls.__checks__,
       cls.__parsers__,
       **multiindex_kwargs,
   )
   return DataFrameSchema(
       columns,
       index=index,
   (...)    102         **kwargs,
   )

File ~/checkouts/readthedocs.org/user_builds/pandera/checkouts/stable/pandera/api/pandas/model.py:142, in DataFrameModel._build_columns_index(cls, fields, checks, parsers, **multiindex_kwargs)
# ``Annotated`` may carry only a FieldInfo (e.g.
# ``Annotated[str, pa.Field(...)]``) without any dtype
# parameters. In that case, use the annotated type as-is.
if _dtype_metadata(annotation):
--> 142     dtype_kwargs = get_dtype_kwargs(annotation)
   dtype = annotation.arg(**dtype_kwargs)  # type: ignore
else:

File ~/checkouts/readthedocs.org/user_builds/pandera/checkouts/stable/pandera/api/dataframe/model.py:102, in get_dtype_kwargs(annotation)
dtype_arg_names = list(sig.parameters.keys())
if len(dtype_params) != len(dtype_arg_names):
--> 102     raise TypeError(
       f"Annotation '{annotation.arg.__name__}' requires "  # type: ignore
       + f"all positional arguments {dtype_arg_names}."
   )
return dict(zip(dtype_arg_names, dtype_params))

TypeError: Annotation 'DatetimeTZDtype' requires all positional arguments ['unit', 'tz'].

Field¶

✅ Good:

class SchemaFieldDatetimeTZDtype(pa.DataFrameModel):
    col: Series[pd.DatetimeTZDtype] = pa.Field(
        dtype_kwargs={"unit": "ns", "tz": "EST"}
    )

You cannot use both typing.Annotated and dtype_kwargs.

❌ Bad:

class SchemaFieldDatetimeTZDtype(pa.DataFrameModel):
    col: Series[Annotated[pd.DatetimeTZDtype, "ns", "est"]] = pa.Field(
        dtype_kwargs={"unit": "ns", "tz": "EST"}
    )

Schema.to_schema()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[16], line 6
   col: Series[Annotated[pd.DatetimeTZDtype, "ns", "est"]] = pa.Field(
       dtype_kwargs={"unit": "ns", "tz": "EST"}
   )

----> 6 Schema.to_schema()

File ~/checkouts/readthedocs.org/user_builds/pandera/checkouts/stable/pandera/api/dataframe/model.py:351, in DataFrameModel.to_schema(cls)
@classmethod
def to_schema(cls) -> TSchema:
   """Create :class:`~pandera.DataFrameSchema` from the :class:`.DataFrameModel`."""
--> 351     return cls.__schema__

File ~/checkouts/readthedocs.org/user_builds/pandera/checkouts/stable/pandera/api/dataframe/model.py:193, in _SchemaDescriptor.__get__(self, obj, cls)
   kwargs = {}
try:
--> 193     MODEL_CACHE[(cls, thread_id)] = cls.build_schema_(**kwargs)
except NotImplementedError as e:
   # Raise AttributeError to signal that this attribute is not available
   # for abstract/incomplete model classes. This allows introspection tools
   # like pydoc to continue without crashing.
   raise AttributeError(
       f"'{cls.__name__}' does not implement build_schema_() and cannot "
       f"generate a schema. To be able to generate a schema, subclass the"
       "DataFrameModel for a specific backend."
   ) from e

File ~/checkouts/readthedocs.org/user_builds/pandera/checkouts/stable/pandera/api/pandas/model.py:91, in DataFrameModel.build_schema_(cls, **kwargs)
@classmethod
def build_schema_(cls, **kwargs) -> DataFrameSchema:
   multiindex_kwargs = {
       name[len("multiindex_") :]: value
       for name, value in vars(cls.__config__).items()
       if name.startswith("multiindex_")
   }
---> 91     columns, index = cls._build_columns_index(
       cls.__fields__,
       cls.__checks__,
       cls.__parsers__,
       **multiindex_kwargs,
   )
   return DataFrameSchema(
       columns,
       index=index,
   (...)    102         **kwargs,
   )

File ~/checkouts/readthedocs.org/user_builds/pandera/checkouts/stable/pandera/api/pandas/model.py:142, in DataFrameModel._build_columns_index(cls, fields, checks, parsers, **multiindex_kwargs)
# ``Annotated`` may carry only a FieldInfo (e.g.
# ``Annotated[str, pa.Field(...)]``) without any dtype
# parameters. In that case, use the annotated type as-is.
if _dtype_metadata(annotation):
--> 142     dtype_kwargs = get_dtype_kwargs(annotation)
   dtype = annotation.arg(**dtype_kwargs)  # type: ignore
else:

File ~/checkouts/readthedocs.org/user_builds/pandera/checkouts/stable/pandera/api/dataframe/model.py:102, in get_dtype_kwargs(annotation)
dtype_arg_names = list(sig.parameters.keys())
if len(dtype_params) != len(dtype_arg_names):
--> 102     raise TypeError(
       f"Annotation '{annotation.arg.__name__}' requires "  # type: ignore
       + f"all positional arguments {dtype_arg_names}."
   )
return dict(zip(dtype_arg_names, dtype_params))

TypeError: Annotation 'DatetimeTZDtype' requires all positional arguments ['unit', 'tz'].

Embedding `Field` metadata in `Annotated`¶

You can also embed a Field() directly inside typing.Annotated to attach field-level metadata — such as description, title, unique, checks (ge, le, isin, etc.), or custom metadata — without having to provide an explicit = pa.Field(...) assignment. This is useful when you want the type annotation itself to fully describe the column.

from typing import Annotated

import pandas as pd
import pandera.pandas as pa


class Schema(pa.DataFrameModel):
    name: Annotated[str, pa.Field(description="Name of the person")]
    age: Annotated[int, pa.Field(ge=0, description="Age of the person")]
    month: Annotated[int, pa.Field(ge=1, le=12)]
    identifier: Annotated[int, pa.Field(unique=True, title="Identifier")]


schema = Schema.to_schema()
schema.columns["name"].description

'Name of the person'

schema.columns["month"].checks

[<Check greater_than_or_equal_to: greater_than_or_equal_to(1)>,
 <Check less_than_or_equal_to: less_than_or_equal_to(12)>]

When the annotation also carries dtype parameters (e.g. Annotated[pd.DatetimeTZDtype, "ns", "est"]), you can still append a Field at the end:

class SchemaWithDtypeParamsAndField(pa.DataFrameModel):
    ts: Annotated[
        pd.DatetimeTZDtype, "ns", "est", pa.Field(description="Timestamp")
    ]

If both an embedded Field and an explicit assignment are provided, the explicit assignment takes precedence:

class SchemaExplicitWins(pa.DataFrameModel):
    value: Annotated[int, pa.Field(description="from annotated")] = (
        pa.Field(description="from assignment")
    )


SchemaExplicitWins.to_schema().columns["value"].description

'from assignment'

Required Columns¶

By default all columns specified in the schema are required, meaning that if a column is missing in the input DataFrame an exception will be thrown. If you want to make a column optional, annotate it with typing.Optional.

from typing import Optional

import pandas as pd
import pandera.pandas as pa
from pandera.typing import Series


class Schema(pa.DataFrameModel):
    a: Series[str]
    b: Optional[Series[int]]

df = pd.DataFrame({"a": ["2001", "2002", "2003"]})
Schema.validate(df)

	a
0	2001
1	2002
2	2003

Optional means that a field may be absent from the input DataFrame. It does not add the field during validation. To add missing fields, use add_missing_columns=True in the model Config with required fields that specify a default value or nullable=True.

Schema Inheritance¶

You can also use inheritance to build schemas on top of a base schema.

class BaseSchema(pa.DataFrameModel):
    year: Series[str]

class FinalSchema(BaseSchema):
    year: Series[int] = pa.Field(ge=2000, coerce=True)  # overwrite the base type
    passengers: Series[int]
    idx: Index[int] = pa.Field(ge=0)

df = pd.DataFrame({
    "year": ["2000", "2001", "2002"],
})

@pa.check_types
def transform(df: DataFrame[BaseSchema]) -> DataFrame[FinalSchema]:
    return (
        df.assign(passengers=[61000, 50000, 45000])
        .set_index(pd.Index([1, 2, 3]))
        .astype({"year": int})
    )

transform(df)

	year	passengers
1	2000	61000
2	2001	50000
3	2002	45000

Multiple Inheritance¶

Multiple inheritance is also supported, making it easy to compose schemas from reusable “mixin” models without repeating yourself. For this to work, each of the base classes must inherit from DataFrameModel, otherwise their fields are not collected:

class A(pa.DataFrameModel):
    a: Series[int]

class B(pa.DataFrameModel):
    b: Series[int]

class C(A, B):
    c: Series[int]

C.to_schema()

<Schema DataFrameSchema(columns={'b': <Schema Column(name=b, type=DataType(int64))>, 'a': <Schema Column(name=a, type=DataType(int64))>, 'c': <Schema Column(name=c, type=DataType(int64))>}, checks=[], parsers=[], index=None, dtype=None, coerce=False, strict=False, name=C, ordered=False, unique=None, report_duplicates=all, unique_column_names=False, add_missing_columns=False, title=None, description=None, metadata=None, drop_invalid_rows=False)>

If a base class does not inherit from DataFrameModel, its fields are not recognized and a KeyError is raised when calling to_schema():

class A:  # not a DataFrameModel!
    a: Series[int]

class C(pa.DataFrameModel, A):
    c: Series[int]

C.to_schema()  # raises KeyError: 'a'

Config¶

Schema-wide options can be controlled via the Config class on the DataFrameModel subclass. The full set of options can be found in the BaseConfig class.

class Schema(pa.DataFrameModel):

    year: Series[int] = pa.Field(gt=2000, coerce=True)
    month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
    day: Series[int] = pa.Field(ge=0, le=365, coerce=True)

    class Config:
        name = "BaseSchema"
        strict = True
        coerce = True
        foo = "bar"  # Interpreted as dataframe check
        baz = ...    # Interpreted as a dataframe check with no additional arguments

It is not required for the Config to subclass BaseConfig but it must be named ‘Config’.

See Registered Custom Checks with the Class-based API for details on using registered dataframe checks.

MultiIndex¶

The MultiIndex capabilities are also supported with the class-based API:

import pandera.pandas as pa
from pandera.typing import Index, Series

class MultiIndexSchema(pa.DataFrameModel):

    year: Index[int] = pa.Field(gt=2000, coerce=True)
    month: Index[int] = pa.Field(ge=1, le=12, coerce=True)
    passengers: Series[int]

    class Config:
        # provide multi index options in the config
        multiindex_name = "time"
        multiindex_strict = True
        multiindex_coerce = True

index = MultiIndexSchema.to_schema().index
print(index)

<Schema MultiIndex(
    indexes=[
        <Schema Index(name=year, type=DataType(int64))>
        <Schema Index(name=month, type=DataType(int64))>
    ]
    coerce=True,
    strict=True,
    name=time,
    ordered=True
)>

from pprint import pprint

pprint({name: col.checks for name, col in index.columns.items()})

{'month': [<Check greater_than_or_equal_to: greater_than_or_equal_to(1)>,
           <Check less_than_or_equal_to: less_than_or_equal_to(12)>],
 'year': [<Check greater_than: greater_than(2000)>]}

Multiple Index annotations are automatically converted into a MultiIndex. MultiIndex options are given in the Config.

Index Name¶

Use check_name to validate the index name of a single-index dataframe:

import pandas as pd
import pandera.pandas as pa
from pandera.typing import Index, Series

class Schema(pa.DataFrameModel):
    year: Series[int] = pa.Field(gt=2000, coerce=True)
    passengers: Series[int]
    idx: Index[int] = pa.Field(ge=0, check_name=True)

df = pd.DataFrame({
    "year": [2001, 2002, 2003],
    "passengers": [61000, 50000, 45000],
})

try:
    Schema.validate(df)
except pa.errors.SchemaError as exc:
    print(exc)

Expected <class 'pandas.core.series.Series'> to have name 'idx', found 'None'

check_name default value of None translates to True for columns and multi-index.

Custom Checks¶

Unlike the object-based API, custom checks can be specified as class methods.

Column/Index checks¶

import pandera.pandas as pa
from pandera.typing import Index, Series

class CustomCheckSchema(pa.DataFrameModel):

    a: Series[int] = pa.Field(gt=0, coerce=True)
    abc: Series[int]
    idx: Index[str]

    @pa.check("a", name="foobar")
    def custom_check(cls, a: Series[int]) -> Series[bool]:
        return a < 100

    @pa.check("^a", regex=True, name="foobar")
    def custom_check_regex(cls, a: Series[int]) -> Series[bool]:
        return a > 0

    @pa.check("idx")
    def check_idx(cls, idx: Index[int]) -> Series[bool]:
        return idx.str.contains("dog")

Note

You can supply the key-word arguments of the Check class initializer to get the flexibility of groupby checks
Similarly to pydantic, classmethod() decorator is added behind the scenes if omitted.
You still may need to add the @classmethod decorator after the check() decorator if your static-type checker or linter complains.
Since checks are class methods, the first argument value they receive is a DataFrameModel subclass, not an instance of a model.

from typing import Dict

class GroupbyCheckSchema(pa.DataFrameModel):

    value: Series[int] = pa.Field(gt=0, coerce=True)
    group: Series[str] = pa.Field(isin=["A", "B"])

    @pa.check("value", groupby="group", regex=True, name="check_means")
    def check_groupby(cls, grouped_value: Dict[str, Series[int]]) -> bool:
        return grouped_value["A"].mean() < grouped_value["B"].mean()

df = pd.DataFrame({
    "value": [100, 110, 120, 10, 11, 12],
    "group": list("AAABBB"),
})

try:
    print(GroupbyCheckSchema.validate(df))
except pa.errors.SchemaError as exc:
    print(exc)

Column 'value' failed series or dataframe validator 1: <Check check_means>

DataFrame Checks¶

You can also define dataframe-level checks, similar to the object-based API, using the dataframe_check() decorator:

import pandas as pd
import pandera.pandas as pa
from pandera.typing import Index, Series

class DataFrameCheckSchema(pa.DataFrameModel):

    col1: Series[int] = pa.Field(gt=0, coerce=True)
    col2: Series[float] = pa.Field(gt=0, coerce=True)
    col3: Series[float] = pa.Field(lt=0, coerce=True)

    @pa.dataframe_check
    def product_is_negative(cls, df: pd.DataFrame) -> Series[bool]:
        return df["col1"] * df["col2"] * df["col3"] < 0

df = pd.DataFrame({
    "col1": [1, 2, 3],
    "col2": [5, 6, 7],
    "col3": [-1, -2, -3],
})

DataFrameCheckSchema.validate(df)

	col1	col2	col3
0	1	5.0	-1.0
1	2	6.0	-2.0
2	3	7.0	-3.0

Inheritance¶

The custom checks are inherited and therefore can be overwritten by the subclass.

import pandas as pd
import pandera.pandas as pa
from pandera.typing import Index, Series

class Parent(pa.DataFrameModel):

    a: Series[int] = pa.Field(coerce=True)

    @pa.check("a", name="foobar")
    def check_a(cls, a: Series[int]) -> Series[bool]:
        return a < 100


class Child(Parent):

    a: Series[int] = pa.Field(coerce=False)

    @pa.check("a", name="foobar")
    def check_a(cls, a: Series[int]) -> Series[bool]:
        return a > 100

is_a_coerce = Child.to_schema().columns["a"].coerce
print(f"coerce: {is_a_coerce}")

coerce: False

df = pd.DataFrame({"a": [1, 2, 3]})

try:
    Child.validate(df)
except pa.errors.SchemaError as exc:
    print(exc)

Column 'a' failed element-wise validator number 0: <Check foobar> failure cases: 1, 2, 3

Aliases¶

DataFrameModel supports columns which are not valid python variable names via the argument alias of Field.

Checks must reference the aliased names.

import pandera.pandas as pa
import pandas as pd

class Schema(pa.DataFrameModel):
    col_2020: pa.typing.Series[int] = pa.Field(alias=2020)
    idx: pa.typing.Index[int] = pa.Field(alias="_idx", check_name=True)

    @pa.check(2020)
    def int_column_lt_100(cls, series):
        return series < 100


df = pd.DataFrame({2020: [99]}, index=[0])
df.index.name = "_idx"

print(Schema.validate(df))

      2020
_idx      
0       99

(New in 0.6.2) The alias is respected when using the class attribute to get the underlying pd.DataFrame column name or index level name.

print(Schema.col_2020)

Very similar to the example above, you can also use the variable name directly within the class scope, and it will respect the alias.

Note

To access a variable from the class scope, you need to make it a class attribute, and therefore assign it a default Field.

import pandera.pandas as pa
import pandas as pd

class Schema(pa.DataFrameModel):
    a: pa.typing.Series[int] = pa.Field()
    col_2020: pa.typing.Series[int] = pa.Field(alias=2020)

    @pa.check(col_2020)
    def int_column_lt_100(cls, series):
        return series < 100

    @pa.check(a)
    def int_column_gt_100(cls, series):
        return series > 100


df = pd.DataFrame({2020: [99], "a": [101]})
print(Schema.validate(df))

   2020    a
0    99  101

Manipulating DataFrame Models post-definition¶

One caveat of using inheritance to build schemas on top of each other is that there is no clear way of how a child class can e.g. remove fields or update them without completely overriding previous settings. This is because inheritance is strictly additive.

DataFrameSchema objects do have these options though, as described in DataFrameSchema Transformations, which you can leverage by overriding your DataFrame Model’s to_schema() method.

DataFrame Models are for the most part just a proxy for the DataFrameSchema API; calling validate() will just redirect to the validate method of the Data Frame Schema’s validate returned by to_schema. As such, any updates to the schema that took place in there will propagate cleanly.

As an example, the following class hierarchy can not remove the fields b and c from Baz into a base-class without completely convoluting the inheritance tree. So, we can get rid of them like this:

import pandera.pandas as pa
import pandas as pd

class Foo(pa.DataFrameModel):
    a: pa.typing.Series[int]
    b: pa.typing.Series[int]

class Bar(pa.DataFrameModel):
    c: pa.typing.Series[int]
    d: pa.typing.Series[int]

class Baz(Foo, Bar):

    @classmethod
    def to_schema(cls) -> pa.DataFrameSchema:
        schema = super().to_schema()
        return schema.remove_columns(["b", "c"])

df = pd.DataFrame({"a": [99], "d": [101]})
print(Baz.validate(df))

    a    d
0  99  101

Note

There are drawbacks to manipulating schema shape in this way:

Static code analysis has no way to figure out what fields have been removed/updated from the class definitions and inheritance hierarchy.
Any children of classes which have overridden to_schema might experience surprising behavior – if a child of Baz tries to define a field b or c again, it will lose it in its to_schema call because Baz’s to_schema will always be executed after any child’s class body has already been fully assembled.

DataFrame Models¶

Basic Usage¶

Using Data Types directly for Column Type Annotations¶

Reusing Field objects¶

Validate on Initialization¶

GeoPandas GeoDataFrameModel¶

Converting to DataFrameSchema¶

Validate Against Multiple Schemas¶

Excluded attributes¶

Supported dtypes¶

Dtype aliases¶

Type Vs instance¶

Parametrized dtypes¶

Annotated¶

Field¶

Embedding Field metadata in Annotated¶

Required Columns¶

Schema Inheritance¶

Multiple Inheritance¶

Config¶

MultiIndex¶

Index Name¶

Custom Checks¶

Column/Index checks¶

DataFrame Checks¶

Inheritance¶

Aliases¶

Manipulating DataFrame Models post-definition¶

GeoPandas `GeoDataFrameModel`¶

Embedding `Field` metadata in `Annotated`¶