Schema Models

new in 0.5.0

pandera provides a class-based API that’s heavily inspired by pydantic. In contrast to the object-based API, you can define schema models in much the same way you’d define pydantic models.

Schema Models are annotated with the pandera.typing module using the standard typing syntax. Models can be explictly converted to a DataFrameSchema or used to validate a DataFrame directly.

Note

Due to current limitations in the pandas library (see discussion here), pandera annotations are only used for run-time validation and cannot be leveraged by static-type checkers like mypy. See the discussion here for more details.

Basic Usage

import pandas as pd
import pandera as pa
from pandera.typing import Index, DataFrame, Series


class InputSchema(pa.SchemaModel):
    year: Series[int] = pa.Field(gt=2000, coerce=True)
    month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
    day: Series[int] = pa.Field(ge=0, le=365, coerce=True)

class OutputSchema(InputSchema):
    revenue: Series[float]

@pa.check_types
def transform(df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
    return df.assign(revenue=100.0)


df = pd.DataFrame({
    "year": ["2001", "2002", "2003"],
    "month": ["3", "6", "12"],
    "day": ["200", "156", "365"],
})

transform(df)

invalid_df = pd.DataFrame({
    "year": ["2001", "2002", "1999"],
    "month": ["3", "6", "12"],
    "day": ["200", "156", "365"],
})
transform(invalid_df)
Traceback (most recent call last):
...
pandera.errors.SchemaError: <Schema Column: 'year' type=<class 'int'>> failed element-wise validator 0:
<Check greater_than: greater_than(2000)>
failure cases:
   index  failure_case
0      2          1999

As you can see in the example above, you can define a schema by sub-classing SchemaModel and defining column/index fields as class attributes. The check_types() decorator is required to perform validation of the dataframe at run-time.

Note that Field s apply to both Column and Index objects, exposing the built-in Check s via key-word arguments.

Converting to DataFrameSchema

You can easily convert a SchemaModel class into a DataFrameSchema:

print(InputSchema.to_schema())
DataFrameSchema(
    columns={
        "year": "<Schema Column: 'year' type=<class 'int'>>",
        "month": "<Schema Column: 'month' type=<class 'int'>>",
        "day": "<Schema Column: 'day' type=<class 'int'>>"
    },
    checks=[],
    index=None,
    coerce=False,
    strict=False
)

Or use the validate() method to validate dataframes:

print(InputSchema.validate(df))
   year  month  day
0  2001      3  200
1  2002      6  156
2  2003     12  365

Supported dtypes

Any dtypes supported by pandera can be used as type parameters for Series and Index. There are, however, a couple of gotchas:

  1. The enumeration PandasDtype is not directly supported because the type parameter of a typing.Generic cannot be an enumeration 1. Instead, you can use the pandera.typing counterparts: pandera.typing.Category, pandera.typing.Float32, …

Good:

import pandera as pa
from pandera.typing import Series, String

class Schema(pa.SchemaModel):
    a: Series[String]

Bad:

class Schema(pa.SchemaModel):
    a: Series[pa.PandasDtype.String]
Traceback (most recent call last):
...
AttributeError: type object 'Generic' has no attribute 'value'
  1. You must give a type, not an instance.

Good:

import pandas as pd

class Schema(pa.SchemaModel):
    a: Series[pd.StringDtype]

Bad:

class Schema(pa.SchemaModel):
    a: Series[pd.StringDtype()]
Traceback (most recent call last):
...
TypeError: Parameters to generic types must be types. Got StringDtype.

Required Columns

By default all columns specified in the schema are required, meaning that if a column is missing in the input DataFrame an exception will be thrown. If you want to make a column optional, annotate it with typing.Optional.

from typing import Optional

import pandas as pd
import pandera as pa
from pandera.typing import Series


class Schema(pa.SchemaModel):
    a: Series[str]
    b: Optional[Series[int]]


df = pd.DataFrame({"a": ["2001", "2002", "2003"]})
Schema.validate(df)

Schema Inheritance

You can also use inheritance to build schemas on top of a base schema.

class BaseSchema(pa.SchemaModel):
    year: Series[str]

class FinalSchema(BaseSchema):
    year: Series[int] = pa.Field(ge=2000, coerce=True)  # overwrite the base type
    passengers: Series[int]
    idx: Index[int] = pa.Field(ge=0)

df = pd.DataFrame({
    "year": ["2000", "2001", "2002"],
})

@pa.check_types
def transform(df: DataFrame[BaseSchema]) -> DataFrame[FinalSchema]:
    return (
        df.assign(passengers=[61000, 50000, 45000])
        .set_index(pd.Index([1, 2, 3]))
        .astype({"year": int})
    )

print(transform(df))
   year  passengers
1  2000       61000
2  2001       50000
3  2002       45000

Config

Schema-wide options can be controlled via the Config class on the SchemaModel subclass. The full set of options can be found in the BaseConfig class.

class Schema(pa.SchemaModel):

    year: Series[int] = pa.Field(gt=2000, coerce=True)
    month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
    day: Series[int] = pa.Field(ge=0, le=365, coerce=True)

    class Config:
        name = "BaseSchema"
        strict = True
        coerce = True
        foo = "bar"  # not a valid option, ignored

It is not required for the Config to subclass BaseConfig but it must be named ‘Config’.

MultiIndex

The MultiIndex capabilities are also supported with the class-based API:

import pandera as pa
from pandera.typing import Index, Series

class MultiIndexSchema(pa.SchemaModel):

    year: Index[int] = pa.Field(gt=2000, coerce=True)
    month: Index[int] = pa.Field(ge=1, le=12, coerce=True)
    passengers: Series[int]

    class Config:
        # provide multi index options in the config
        multiindex_name = "time"
        multiindex_strict = True
        multiindex_coerce = True

index = MultiIndexSchema.to_schema().index
print(index)
MultiIndex(
    columns={
        "year": "<Schema Column: 'year' type=<class 'int'>>",
        "month": "<Schema Column: 'month' type=<class 'int'>>"
    },
    checks=[],
    index=None,
    coerce=True,
    strict=True
)
from pprint import pprint

pprint({name: col.checks for name, col in index.columns.items()})
{'month': [<Check greater_than_or_equal_to: greater_than_or_equal_to(1)>,
        <Check less_than_or_equal_to: less_than_or_equal_to(12)>],
'year': [<Check greater_than: greater_than(2000)>]}

Multiple Index annotations are automatically converted into a MultiIndex. MultiIndex options are given in the Config.

Custom Checks

Unlike the object-based API, custom checks can be specified as class methods.

Column/Index checks

import pandera as pa
from pandera.typing import Index, Series

class CustomCheckSchema(pa.SchemaModel):

    a: Series[int] = pa.Field(gt=0, coerce=True)
    abc: Series[int]
    idx: Index[str]

    @pa.check("a", name="foobar")
    def custom_check(cls, a: Series[int]) -> Series[bool]:
        return a < 100

    @pa.check("^a", regex=True, name="foobar")
    def custom_check_regex(cls, a: Series[int]) -> Series[bool]:
        return a > 0

    @pa.check("idx")
    def check_idx(cls, idx: Index[int]) -> Series[bool]:
        return idx.str.contains("dog")

Note

  • You can supply the key-word arguments of the Check class initializer to get the flexibility of groupby checks

  • Similarly to pydantic, classmethod() decorator is added behind the scenes if omitted.

  • You still may need to add the @classmethod decorator after the check() decorator if your static-type checker or linter complains.

  • Since checks are class methods, the first argument value they receive is a SchemaModel subclass, not an instance of a model.

from typing import Dict

class GroupbyCheckSchema(pa.SchemaModel):

    value: Series[int] = pa.Field(gt=0, coerce=True)
    group: Series[str] = pa.Field(isin=["A", "B"])

    @pa.check("value", groupby="group", regex=True, name="check_means")
    def check_groupby(cls, grouped_value: Dict[str, Series[int]]) -> bool:
        return grouped_value["A"].mean() < grouped_value["B"].mean()

df = pd.DataFrame({
    "value": [100, 110, 120, 10, 11, 12],
    "group": list("AAABBB"),
})

print(GroupbyCheckSchema.validate(df))
Traceback (most recent call last):
...
pandera.errors.SchemaError: <Schema Column: 'value' type=<class 'int'>> failed series validator 1:
<Check check_means>

DataFrame Checks

You can also define dataframe-level checks, similar to the object-based API, using the dataframe_check() decorator:

import pandas as pd
import pandera as pa
from pandera.typing import Index, Series

class DataFrameCheckSchema(pa.SchemaModel):

    col1: Series[int] = pa.Field(gt=0, coerce=True)
    col2: Series[float] = pa.Field(gt=0, coerce=True)
    col3: Series[float] = pa.Field(lt=0, coerce=True)

    @pa.dataframe_check
    def product_is_negative(cls, df: pd.DataFrame) -> Series[bool]:
        return df["col1"] * df["col2"] * df["col3"] < 0

df = pd.DataFrame({
    "col1": [1, 2, 3],
    "col2": [5, 6, 7],
    "col3": [-1, -2, -3],
})

DataFrameCheckSchema.validate(df)

Inheritance

The custom checks are inherited and therefore can be overwritten by the subclass.

import pandas as pd
import pandera as pa
from pandera.typing import Index, Series

class Parent(pa.SchemaModel):

    a: Series[int] = pa.Field(coerce=True)

    @pa.check("a", name="foobar")
    def check_a(cls, a: Series[int]) -> Series[bool]:
        return a < 100


class Child(Parent):

    a: Series[int] = pa.Field(coerce=False)

    @pa.check("a", name="foobar")
    def check_a(cls, a: Series[int]) -> Series[bool]:
        return a > 100

is_a_coerce = Child.to_schema().columns["a"].coerce
print(f"coerce: {is_a_coerce}")
coerce: False
df = pd.DataFrame({"a": [1, 2, 3]})
print(Child.validate(df))
Traceback (most recent call last):
...
pandera.errors.SchemaError: <Schema Column: 'a' type=<class 'int'>> failed element-wise validator 0:
<Check foobar>
failure cases:
    index  failure_case
0      0             1
1      1             2
2      2             3

Aliases

SchemaModel supports columns which are not valid python variable names via the argument alias of Field.

Checks must reference the aliased names.

import pandera as pa
import pandas as pd

class Schema(pa.SchemaModel):
    col_2020: pa.typing.Series[int] = pa.Field(alias=2020)
    idx: pa.typing.Index[int] = pa.Field(alias="_idx", check_name=True)

    @pa.check(2020)
    def int_column_lt_100(cls, series):
        return series < 100


df = pd.DataFrame({2020: [99]}, index=[0])
df.index.name = "_idx"

print(Schema.validate(df))
      2020
_idx
0       99

Footnotes

1

It is actually possible to use a PandasDtype by encasing it in a typing.Literal like Series[Literal[PandasDtype.Category]]. pandera.typing defines aliases to reduce boilerplate.