Integrations¶
Pydantic¶
new in 0.8.0
SchemaModel
is fully compatible with
pydantic.
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
import pydantic
class SimpleSchema(pa.SchemaModel):
str_col: Series[str] = pa.Field(unique=True)
class PydanticModel(pydantic.BaseModel):
x: int
df: DataFrame[SimpleSchema]
valid_df = pd.DataFrame({"str_col": ["hello", "world"]})
PydanticModel(x=1, df=valid_df)
invalid_df = pd.DataFrame({"str_col": ["hello", "hello"]})
PydanticModel(x=1, df=invalid_df)
Traceback (most recent call last):
...
ValidationError: 1 validation error for PydanticModel
df
series 'str_col' contains duplicate values:
1 hello
Name: str_col, dtype: object (type=value_error)
Other pandera components are also compatible with pydantic:
Mypy¶
new in 0.8.0
Pandera integrates with mypy out of the box to provide static type-linting of dataframes, relying on pandas-stubs for typing information.
In the example below, we define a few schemas to see how type-linting with pandera works.
from typing import cast
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
class Schema(pa.SchemaModel):
id: Series[int]
name: Series[str]
class SchemaOut(pa.SchemaModel):
age: Series[int]
class AnotherSchema(pa.SchemaModel):
id: Series[int]
first_name: Series[str]
The mypy linter will complain if the output type of the function body doesn’t match the function’s return signature.
def fn(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(age=30).pipe(DataFrame[SchemaOut]) # mypy okay
def fn_pipe_incorrect_type(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(age=30).pipe(DataFrame[AnotherSchema]) # mypy error
# error: Argument 1 to "pipe" of "NDFrame" has incompatible type "Type[DataFrame[Any]]"; # noqa
# expected "Union[Callable[..., DataFrame[SchemaOut]], Tuple[Callable[..., DataFrame[SchemaOut]], str]]" [arg-type] # noqa
def fn_assign_copy(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(age=30) # mypy error
# error: Incompatible return value type (got "pandas.core.frame.DataFrame",
# expected "pandera.typing.pandas.DataFrame[SchemaOut]") [return-value]
It’ll also complain if the input type doesn’t match the expected input type.
Note that we’re using the pandera.typing.pandas.DataFrame
generic
type to define dataframes that are validated against the
SchemaModel
type variable on initialization.
schema_df = DataFrame[Schema]({"id": [1], "name": ["foo"]})
pandas_df = pd.DataFrame({"id": [1], "name": ["foo"]})
another_df = DataFrame[AnotherSchema]({"id": [1], "first_name": ["foo"]})
fn(schema_df) # mypy okay
fn(pandas_df) # mypy error
# error: Argument 1 to "fn" has incompatible type "pandas.core.frame.DataFrame"; # noqa
# expected "pandera.typing.pandas.DataFrame[Schema]" [arg-type]
fn(another_df) # mypy error
# error: Argument 1 to "fn" has incompatible type "DataFrame[AnotherSchema]";
# expected "DataFrame[Schema]" [arg-type]
To make mypy happy with respect to the return type, you can either initialize a dataframe of the expected type:
def fn_pipe_dataframe(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(age=30).pipe(DataFrame[SchemaOut]) # mypy okay
Note
If you use the approach above with the check_types()
decorator, pandera will do its best to not to validate the dataframe twice
if it’s already been initialized with the
DataFrame[Schema](**data)
syntax.
Or use typing.cast()
to indicate to mypy that the return value of
the function is of the correct type.
def fn_cast_dataframe(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return cast(DataFrame[SchemaOut], df.assign(age=30)) # mypy okay
Limitations¶
An important caveat to static type-linting with pandera dataframe types is that,
since pandas dataframes are mutable objects, there’s no way for mypy
to
know whether a mutated instance of a
SchemaModel
-typed dataframe has the correct
contents. Fortunately, we can simply rely on the check_types()
decorator to verify that the output dataframe is valid.
Consider the examples below:
def fn_pipe_dataframe(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(age=30).pipe(DataFrame[SchemaOut]) # mypy okay
def fn_cast_dataframe(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return cast(DataFrame[SchemaOut], df.assign(age=30)) # mypy okay
@pa.check_types
def fn_mutate_inplace(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
out = df.assign(age=30).pipe(DataFrame[SchemaOut])
out.drop(["age"], axis=1, inplace=True)
return out # okay for mypy, pandera raises error
@pa.check_types
def fn_assign_and_get_index(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(foo=30).iloc[:3] # okay for mypy, pandera raises error
Even though the outputs of these functions are incorrect, mypy doesn’t catch
the error during static type-linting but pandera will raise a
SchemaError
or SchemaErrors
exception at runtime, depending on whether you’re doing
lazy validation or not.
@pa.check_types
def fn_cast_dataframe_invalid(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return cast(
DataFrame[SchemaOut], df
) # okay for mypy, pandera raises error