Pandera Data Types#

new in 0.7.0


Pandera defines its own interface for data types in order to abstract the specifics of dataframe-like data structures in the python ecosystem, such as Apache Spark, Apache Arrow and xarray.


In the following section Pandera Data Type refers to a pandera.dtypes.DataType object whereas native data type refers to data types used by third-party libraries that Pandera supports (e.g. pandas).

Most of the time, it is transparent to end users since pandera columns and indexes accept native data types. However, it is possible to extend the pandera interface by:

  • modifying the data type check performed during schema validation.

  • modifying the behavior of the coerce argument for DataFrameSchema.

  • adding your own custom data types.

DataType basics#

All pandera data types inherit from pandera.dtypes.DataType and must be hashable.

A data type implements three key methods:

For pandera’s validation methods to be aware of a data type, it has to be registered with the targeted engine via pandera.engines.engine.Engine.register_dtype(). An engine is in charge of mapping a pandera DataType with a native data type counterpart belonging to a third-party library. The mapping can be queried with pandera.engines.engine.Engine.dtype().

As of pandera 0.7.0, only the pandas Engine is supported.


Let’s extend pandas.BooleanDtype coercion to handle the string literals "True" and "False".

import pandas as pd
import pandera as pa
from pandera import dtypes
from pandera.engines import pandas_engine

@pandas_engine.Engine.register_dtype  # step 1
@dtypes.immutable  # step 2
class LiteralBool(pandas_engine.BOOL):  # step 3
    def coerce(self, series: pd.Series) -> pd.Series:
        """Coerce a pandas.Series to date types."""
        if pd.api.types.is_string_dtype(series):
            series = series.replace({"True": 1, "False": 0})
        return series.astype("boolean")

data = pd.Series(["True", "False"], name="literal_bools")

# step 4
    pa.SeriesSchema(LiteralBool(), coerce=True, name="literal_bools")

The example above performs the following steps:

  1. Register the data type with the pandas engine.

  2. pandera.dtypes.immutable() creates an immutable (and hashable) dataclass().

  3. Inherit pandera.engines.pandas_engine.BOOL, which is the pandera representation of pandas.BooleanDtype. This is not mandatory but it makes our life easier by having already implemented all the required methods.

  4. Check that our new data type can coerce the string literals.

So far we did not override the default behavior:

import pandera as pa

pa.SeriesSchema("boolean", coerce=True).validate(data)
Traceback (most recent call last):
pandera.errors.SchemaError: Error while coercing 'literal_bools' to type boolean: Need to pass bool-like values

To completely replace the default BOOL, we need to supply all the equivalent representations to register_dtype(). Behind the scenes, when pa.SeriesSchema("boolean") is called the corresponding pandera data type is looked up using pandera.engines.engine.Engine.dtype().

print(f"before: {pandas_engine.Engine.dtype('boolean').__class__}")

    equivalents=["boolean", pd.BooleanDtype, pd.BooleanDtype()],
class LiteralBool(pandas_engine.BOOL):
    def coerce(self, series: pd.Series) -> pd.Series:
        """Coerce a pandas.Series to date types."""
        if pd.api.types.is_string_dtype(series):
            series = series.replace({"True": 1, "False": 0})
        return series.astype("boolean")

print(f"after: {pandas_engine.Engine.dtype('boolean').__class__}")

for dtype in ["boolean", pd.BooleanDtype, pd.BooleanDtype()]:
    pa.SeriesSchema(dtype, coerce=True).validate(data)
before: <class 'pandera.engines.pandas_engine.BOOL'>
after: <class 'LiteralBool'>


For convenience, we specified both pd.BooleanDtype and pd.BooleanDtype() as equivalents. That gives us more flexibility in what pandera schemas can recognize (see last for-loop above).

Parametrized data types#

Some data types can be parametrized. One common example is pandas.CategoricalDtype.

The equivalents argument of register_dtype() does not handle this situation but will automatically register a classmethod() with signature from_parametrized_dtype(cls, equivalent:...) if the decorated DataType defines it. The equivalent argument must be type-annotated because it is leveraged to dispatch the input of dtype to the appropriate from_parametrized_dtype class method.

For example, here is a snippet from pandera.engines.pandas_engine.Category:

import pandas as pd
from pandera import dtypes

def from_parametrized_dtype(
    cls, cat: Union[dtypes.Category, pd.CategoricalDtype]
    """Convert a categorical to
    a Pandera :class:`pandera.dtypes.pandas_engine.Category`."""
    return cls(categories=cat.categories, ordered=cat.ordered)  # type: ignore


The dispatch mechanism relies on functools.singledispatch(). Unlike the built-in implementation, typing.Union is recognized.

Defining the coerce_value method#

For pandera datatypes to understand how to correctly report coercion errors, it needs to know how to coerce an individual value into the specified type.

All pandas data types are supported: numpy -based datatypes use the underlying numpy dtype to coerce an individual value. The pandas -native datatypes like CategoricalDtype and BooleanDtype are also supported.

As an example of a special-cased coerce_value implementation, see coerce_value():

        """Coerce an value to a particular type."""
        if value not in self.categories:  # type: ignore
            raise TypeError(
                f"value {value} cannot be coerced to type {self.type}"
        return value

And coerce_value():

        """Coerce an value to specified datatime type."""
        if value not in self._bool_like:
            raise TypeError(
                f"value {value} cannot be coerced to type {self.type}"
        return super().coerce_value(value)