Mypy#

new in 0.8.0

Pandera integrates with mypy to provide static type-linting of dataframes, relying on pandas-stubs for typing information.

pip install pandera[mypy]

Then enable the plugin in your mypy.ini or setug.cfg file:

[mypy]
plugins = pandera.mypy

Note

Mypy static type-linting is supported for only pandas dataframes.

Warning

This functionality is experimental 🧪. Since the pandas-stubs type stub annotations don’t always match the official pandas effort to support type annotations), installing the pandera[mypy] extra may yield false positives in your pandas code, many of which are are documented in tests/mypy/modules (see here ).

We encourage you to file an issue if you find any false positives or negatives being reported by mypy. A list of such issues can be found here. We’ll most likely have to escalate this to the official pandas-stubs issues .

Also, be aware that the latest pandas-stubs versions only support Python 3.8+. So, if you are using Python 3.7, you will not face an error when installing this package, but pip will install an older version of pandas-stubs with outdated type annotations.

In the example below, we define a few schemas to see how type-linting with pandera works.

from typing import cast

import pandas as pd

import pandera as pa
from pandera.typing import DataFrame, Series


class Schema(pa.DataFrameModel):
    id: Series[int]
    name: Series[str]


class SchemaOut(pa.DataFrameModel):
    age: Series[int]


class AnotherSchema(pa.DataFrameModel):
    id: Series[int]
    first_name: Series[str]

The mypy linter will complain if the output type of the function body doesn’t match the function’s return signature.

def fn(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    return df.assign(age=30).pipe(DataFrame[SchemaOut])  # mypy okay


def fn_pipe_incorrect_type(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    return df.assign(age=30).pipe(DataFrame[AnotherSchema])  # mypy error
    # error: Argument 1 to "pipe" of "NDFrame" has incompatible type "Type[DataFrame[Any]]";  # noqa
    # expected "Union[Callable[..., DataFrame[SchemaOut]], Tuple[Callable[..., DataFrame[SchemaOut]], str]]"  [arg-type]  # noqa


def fn_assign_copy(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    return df.assign(age=30)  # mypy error
    # error: Incompatible return value type (got "pandas.core.frame.DataFrame",
    # expected "pandera.typing.pandas.DataFrame[SchemaOut]")  [return-value]

It’ll also complain if the input type doesn’t match the expected input type. Note that we’re using the pandera.typing.pandas.DataFrame generic type to define dataframes that are validated against the DataFrameModel type variable on initialization.

schema_df = DataFrame[Schema]({"id": [1], "name": ["foo"]})
pandas_df = pd.DataFrame({"id": [1], "name": ["foo"]})
another_df = DataFrame[AnotherSchema]({"id": [1], "first_name": ["foo"]})


fn(schema_df)  # mypy okay

fn(pandas_df)  # mypy error
# error: Argument 1 to "fn" has incompatible type "pandas.core.frame.DataFrame";  # noqa
# expected "pandera.typing.pandas.DataFrame[Schema]"  [arg-type]

fn(another_df)  # mypy error
# error: Argument 1 to "fn" has incompatible type "DataFrame[AnotherSchema]";
# expected "DataFrame[Schema]"  [arg-type]

To make mypy happy with respect to the return type, you can either initialize a dataframe of the expected type:

def fn_pipe_dataframe(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    return df.assign(age=30).pipe(DataFrame[SchemaOut])  # mypy okay

Note

If you use the approach above with the check_types() decorator, pandera will do its best to not to validate the dataframe twice if it’s already been initialized with the DataFrame[Schema](**data) syntax.

Or use typing.cast() to indicate to mypy that the return value of the function is of the correct type.

def fn_cast_dataframe(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    return cast(DataFrame[SchemaOut], df.assign(age=30))  # mypy okay

Limitations#

An important caveat to static type-linting with pandera dataframe types is that, since pandas dataframes are mutable objects, there’s no way for mypy to know whether a mutated instance of a DataFrameModel-typed dataframe has the correct contents. Fortunately, we can simply rely on the check_types() decorator to verify that the output dataframe is valid.

Consider the examples below:

def fn_pipe_dataframe(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    return df.assign(age=30).pipe(DataFrame[SchemaOut])  # mypy okay


def fn_cast_dataframe(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    return cast(DataFrame[SchemaOut], df.assign(age=30))  # mypy okay


@pa.check_types
def fn_mutate_inplace(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    out = df.assign(age=30).pipe(DataFrame[SchemaOut])
    out.drop(["age"], axis=1, inplace=True)
    return out  # okay for mypy, pandera raises error


@pa.check_types
def fn_assign_and_get_index(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    return df.assign(foo=30).iloc[:3]  # mypy error

Even though the outputs of these functions are incorrect, mypy doesn’t catch the error during static type-linting but pandera will raise a SchemaError or SchemaErrors exception at runtime, depending on whether you’re doing lazy validation or not.



@pa.check_types
def fn_cast_dataframe_invalid(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    return cast(