DataFrame Models¶
new in 0.5.0
pandera
provides a class-based API that’s heavily inspired by
pydantic. In contrast to the
object-based API, you can define dataframe models in
much the same way you’d define pydantic
models.
DataFrameModel
s are annotated with the pandera.typing
module using the standard
typing syntax. Models can be
explicitly converted to a DataFrameSchema
or used to validate a
DataFrame
directly.
Note
Due to current limitations in the pandas library (see discussion
here),
pandera
annotations are only used for run-time validation and has
limited support for static-type checkers like mypy.
See the Mypy Integration for more details.
Basic Usage¶
import pandas as pd
import pandera as pa
from pandera.typing import Index, DataFrame, Series
class InputSchema(pa.DataFrameModel):
year: Series[int] = pa.Field(gt=2000, coerce=True)
month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
day: Series[int] = pa.Field(ge=0, le=365, coerce=True)
class OutputSchema(InputSchema):
revenue: Series[float]
@pa.check_types
def transform(df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
return df.assign(revenue=100.0)
df = pd.DataFrame({
"year": ["2001", "2002", "2003"],
"month": ["3", "6", "12"],
"day": ["200", "156", "365"],
})
transform(df)
invalid_df = pd.DataFrame({
"year": ["2001", "2002", "1999"],
"month": ["3", "6", "12"],
"day": ["200", "156", "365"],
})
try:
transform(invalid_df)
except pa.errors.SchemaError as exc:
print(exc)
error in check_types decorator of function 'transform': Column 'year' failed element-wise validator number 0: greater_than(2000) failure cases: 1999
As you can see in the examples above, you can define a schema by sub-classing
DataFrameModel
and defining column/index fields as class attributes.
The check_types()
decorator is required to perform validation of the dataframe at
run-time.
Note that Field
s apply to both
Column
and Index
objects, exposing the built-in Check
s via key-word arguments.
(New in 0.6.2) When you access a class attribute defined on the schema,
it will return the name of the column used in the validated pd.DataFrame
.
In the example above, this will simply be the string "year"
.
print(f"Column name for 'year' is {InputSchema.year}\n")
print(df.loc[:, [InputSchema.year, "day"]])
Column name for 'year' is year
year day
0 2001 200
1 2002 156
2 2003 365
Using Data Types directly for Column Type Annotations¶
new in 0.15.0
For conciseness, you can also use type annotations for columns without using
the Series
generic. This class attributes will be
interpreted as Column
objects
under the hood.
class InputSchema(pa.DataFrameModel):
year: int = pa.Field(gt=2000, coerce=True)
month: int = pa.Field(ge=1, le=12, coerce=True)
day: int = pa.Field(ge=0, le=365, coerce=True)
Reusing Field objects¶
To define reuseable Field
definitions, you need to use functools.partial
.
This makes sure that each field attribute is bound to a unique Field
instance.
from functools import partial
from pandera import DataFrameModel, Field
NormalizedField = partial(Field, ge=0, le=1)
class SchemaWithReusedFields(DataFrameModel):
xnorm: float = NormalizedField()
ynorm: float = NormalizedField()
Validate on Initialization¶
new in 0.8.0
Pandera provides an interface for validating dataframes on initialization.
This API uses the pandera.typing.pandas.DataFrame
generic type
to validated against the DataFrameModel
type variable
on initialization:
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
class Schema(pa.DataFrameModel):
state: Series[str]
city: Series[str]
price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})
DataFrame[Schema](
{
'state': ['NY','FL','GA','CA'],
'city': ['New York', 'Miami', 'Atlanta', 'San Francisco'],
'price': [8, 12, 10, 16],
}
)
state | city | price | |
---|---|---|---|
0 | NY | New York | 8 |
1 | FL | Miami | 12 |
2 | GA | Atlanta | 10 |
3 | CA | San Francisco | 16 |
Refer to Supported DataFrame Libraries to see how this syntax applies to other supported dataframe types.
Converting to DataFrameSchema¶
You can easily convert a DataFrameModel
class into a
DataFrameSchema
:
print(InputSchema.to_schema())
<Schema DataFrameSchema(
columns={
'year': <Schema Column(name=year, type=DataType(int64))>
'month': <Schema Column(name=month, type=DataType(int64))>
'day': <Schema Column(name=day, type=DataType(int64))>
},
checks=[],
parsers=[],
coerce=False,
dtype=None,
index=None,
strict=False,
name=InputSchema,
ordered=False,
unique_column_names=False,
metadata=None,
add_missing_columns=False
)>
You can also use the validate()
method to
validate dataframes:
print(InputSchema.validate(df))
year month day
0 2001 3 200
1 2002 6 156
2 2003 12 365
Or you can use the DataFrameModel()
class directly to
validate dataframes, which is syntactic sugar that simply delegates to the
validate()
method.
print(InputSchema(df))
year month day
0 2001 3 200
1 2002 6 156
2 2003 12 365
Validate Against Multiple Schemas¶
new in 0.14.0
The built-in typing.Union
type is supported for multiple DataFrame
schemas.
from typing import Union
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
class OnlyZeroesSchema(pa.DataFrameModel):
a: Series[int] = pa.Field(eq=0)
class OnlyOnesSchema(pa.DataFrameModel):
a: Series[int] = pa.Field(eq=1)
@pa.check_types
def return_zeros_or_ones(
df: Union[DataFrame[OnlyZeroesSchema], DataFrame[OnlyOnesSchema]]
) -> Union[DataFrame[OnlyZeroesSchema], DataFrame[OnlyOnesSchema]]:
return df
# passes
return_zeros_or_ones(pd.DataFrame({"a": [0, 0]}))
return_zeros_or_ones(pd.DataFrame({"a": [1, 1]}))
# fails
try:
return_zeros_or_ones(pd.DataFrame({"a": [0, 2]}))
except pa.errors.SchemaErrors as exc:
print(exc)
{
"DATA": {
"INVALID_TYPE": [
{
"schema": "OnlyOnesSchema",
"column": "OnlyZeroesSchema",
"check": "equal_to(0)",
"error": "error in check_types decorator of function 'return_zeros_or_ones': Column 'a' failed element-wise validator number 0: equal_to(0) failure cases: 2"
},
{
"schema": "OnlyOnesSchema",
"column": "OnlyOnesSchema",
"check": "equal_to(1)",
"error": "error in check_types decorator of function 'return_zeros_or_ones': Column 'a' failed element-wise validator number 0: equal_to(1) failure cases: 0, 2"
}
]
}
}
Note that mixtures of DataFrame
schemas and built-in types will ignore checking built-in types
with pandera. Pydantic should be used to check and/or coerce any built-in types.
import pandas as pd
from typing import Union
import pandera as pa
from pandera.typing import DataFrame, Series
class OnlyZeroesSchema(pa.DataFrameModel):
a: Series[int] = pa.Field(eq=0)
@pa.check_types
def df_and_int_types(
val: Union[DataFrame[OnlyZeroesSchema], int]
) -> Union[DataFrame[OnlyZeroesSchema], int]:
return val
df_and_int_types(pd.DataFrame({"a": [0, 0]}))
int_val = df_and_int_types(5)
str_val = df_and_int_types("5")
no_pydantic_report = f"No Pydantic: {isinstance(int_val, int)}, {isinstance(str_val, int)}"
@pa.check_types(with_pydantic=True)
def df_and_int_types_with_pydantic(
val: Union[DataFrame[OnlyZeroesSchema], int]
) -> Union[DataFrame[OnlyZeroesSchema], int]:
return val
df_and_int_types_with_pydantic(pd.DataFrame({"a": [0, 0]}))
int_val_w_pyd = df_and_int_types_with_pydantic(5)
str_val_w_pyd = df_and_int_types_with_pydantic("5")
pydantic_report = f"With Pydantic: {isinstance(int_val_w_pyd, int)}, {isinstance(str_val_w_pyd, int)}"
print(no_pydantic_report)
print(pydantic_report)
No Pydantic: True, False
With Pydantic: True, True
Excluded attributes¶
Class variables which begin with an underscore will be automatically excluded from the model. Config is also a reserved name. However, aliases can be used to circumvent these limitations.
Supported dtypes¶
Any dtypes supported by pandera
can be used as type parameters for
Series
and Index
. There are,
however, a couple of gotchas.
Important
You can learn more about how data type validation works Data Type Validation.
Dtype aliases¶
import pandera as pa
from pandera.typing import Series, String
class Schema(pa.DataFrameModel):
a: Series[String]
Type Vs instance¶
You must give a type, not an instance.
✅ Good:
import pandas as pd
class Schema(pa.DataFrameModel):
a: Series[pd.StringDtype]
❌ Bad:
Note
This is only applicable for pandas versions < 2.0.0. In pandas > 2.0.0, pd.StringDtype() will produce a type.
class Schema(pa.DataFrameModel):
a: Series[pd.StringDtype()]
Parametrized dtypes¶
Pandas supports a couple of parametrized dtypes. As of pandas 1.2.0:
Kind of Data |
Data Type |
Parameters |
---|---|---|
tz-aware datetime |
|
|
Categorical |
|
|
period |
|
|
sparse |
|
|
intervals |
|
|
Annotated¶
Parameters can be given via typing.Annotated
. It requires python >= 3.9 or
typing_extensions, which is already a
requirement of Pandera. Unfortunately typing.Annotated
has not been backported
to python 3.6.
✅ Good:
try:
from typing import Annotated # python 3.9+
except ImportError:
from typing_extensions import Annotated
class Schema(pa.DataFrameModel):
col: Series[Annotated[pd.DatetimeTZDtype, "ns", "est"]]
Furthermore, you must pass all parameters in the order defined in the dtype’s constructor (see table).
❌ Bad:
class Schema(pa.DataFrameModel):
col: Series[Annotated[pd.DatetimeTZDtype, "utc"]]
Schema.to_schema()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[14], line 4
1 class Schema(pa.DataFrameModel):
2 col: Series[Annotated[pd.DatetimeTZDtype, "utc"]]
----> 4 Schema.to_schema()
File ~/checkouts/readthedocs.org/user_builds/pandera/conda/latest/lib/python3.12/site-packages/pandera/api/dataframe/model.py:262, in DataFrameModel.to_schema(cls)
248 if cls.__config__ is not None:
249 kwargs = {
250 "dtype": cls.__config__.dtype,
251 "coerce": cls.__config__.coerce,
(...)
260 "drop_invalid_rows": cls.__config__.drop_invalid_rows,
261 }
--> 262 cls.__schema__ = cls.build_schema_(**kwargs)
263 if cls not in MODEL_CACHE:
264 MODEL_CACHE[cls] = cls.__schema__ # type: ignore
File ~/checkouts/readthedocs.org/user_builds/pandera/conda/latest/lib/python3.12/site-packages/pandera/api/pandas/model.py:39, in DataFrameModel.build_schema_(cls, **kwargs)
32 @classmethod
33 def build_schema_(cls, **kwargs) -> DataFrameSchema:
34 multiindex_kwargs = {
35 name[len("multiindex_") :]: value
36 for name, value in vars(cls.__config__).items()
37 if name.startswith("multiindex_")
38 }
---> 39 columns, index = cls._build_columns_index(
40 cls.__fields__,
41 cls.__checks__,
42 cls.__parsers__,
43 **multiindex_kwargs,
44 )
45 return DataFrameSchema(
46 columns,
47 index=index,
(...)
50 **kwargs,
51 )
File ~/checkouts/readthedocs.org/user_builds/pandera/conda/latest/lib/python3.12/site-packages/pandera/api/pandas/model.py:85, in DataFrameModel._build_columns_index(cls, fields, checks, parsers, **multiindex_kwargs)
79 if field.dtype_kwargs:
80 raise TypeError(
81 "Cannot specify redundant 'dtype_kwargs' "
82 + f"for {annotation.raw_annotation}."
83 + "\n Usage Tip: Drop 'typing.Annotated'."
84 )
---> 85 dtype_kwargs = get_dtype_kwargs(annotation)
86 dtype = annotation.arg(**dtype_kwargs) # type: ignore
87 elif annotation.default_dtype:
File ~/checkouts/readthedocs.org/user_builds/pandera/conda/latest/lib/python3.12/site-packages/pandera/api/dataframe/model.py:73, in get_dtype_kwargs(annotation)
71 dtype_arg_names = list(sig.parameters.keys())
72 if len(annotation.metadata) != len(dtype_arg_names): # type: ignore
---> 73 raise TypeError(
74 f"Annotation '{annotation.arg.__name__}' requires " # type: ignore
75 + f"all positional arguments {dtype_arg_names}."
76 )
77 return dict(zip(dtype_arg_names, annotation.metadata))
TypeError: Annotation 'DatetimeTZDtype' requires all positional arguments ['unit', 'tz'].
Field¶
✅ Good:
class SchemaFieldDatetimeTZDtype(pa.DataFrameModel):
col: Series[pd.DatetimeTZDtype] = pa.Field(
dtype_kwargs={"unit": "ns", "tz": "EST"}
)
You cannot use both typing.Annotated
and dtype_kwargs
.
❌ Bad:
class SchemaFieldDatetimeTZDtype(pa.DataFrameModel):
col: Series[Annotated[pd.DatetimeTZDtype, "ns", "est"]] = pa.Field(
dtype_kwargs={"unit": "ns", "tz": "EST"}
)
Schema.to_schema()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[16], line 6
1 class SchemaFieldDatetimeTZDtype(pa.DataFrameModel):
2 col: Series[Annotated[pd.DatetimeTZDtype, "ns", "est"]] = pa.Field(
3 dtype_kwargs={"unit": "ns", "tz": "EST"}
4 )
----> 6 Schema.to_schema()
File ~/checkouts/readthedocs.org/user_builds/pandera/conda/latest/lib/python3.12/site-packages/pandera/api/dataframe/model.py:262, in DataFrameModel.to_schema(cls)
248 if cls.__config__ is not None:
249 kwargs = {
250 "dtype": cls.__config__.dtype,
251 "coerce": cls.__config__.coerce,
(...)
260 "drop_invalid_rows": cls.__config__.drop_invalid_rows,
261 }
--> 262 cls.__schema__ = cls.build_schema_(**kwargs)
263 if cls not in MODEL_CACHE:
264 MODEL_CACHE[cls] = cls.__schema__ # type: ignore
File ~/checkouts/readthedocs.org/user_builds/pandera/conda/latest/lib/python3.12/site-packages/pandera/api/pandas/model.py:39, in DataFrameModel.build_schema_(cls, **kwargs)
32 @classmethod
33 def build_schema_(cls, **kwargs) -> DataFrameSchema:
34 multiindex_kwargs = {
35 name[len("multiindex_") :]: value
36 for name, value in vars(cls.__config__).items()
37 if name.startswith("multiindex_")
38 }
---> 39 columns, index = cls._build_columns_index(
40 cls.__fields__,
41 cls.__checks__,
42 cls.__parsers__,
43 **multiindex_kwargs,
44 )
45 return DataFrameSchema(
46 columns,
47 index=index,
(...)
50 **kwargs,
51 )
File ~/checkouts/readthedocs.org/user_builds/pandera/conda/latest/lib/python3.12/site-packages/pandera/api/pandas/model.py:85, in DataFrameModel._build_columns_index(cls, fields, checks, parsers, **multiindex_kwargs)
79 if field.dtype_kwargs:
80 raise TypeError(
81 "Cannot specify redundant 'dtype_kwargs' "
82 + f"for {annotation.raw_annotation}."
83 + "\n Usage Tip: Drop 'typing.Annotated'."
84 )
---> 85 dtype_kwargs = get_dtype_kwargs(annotation)
86 dtype = annotation.arg(**dtype_kwargs) # type: ignore
87 elif annotation.default_dtype:
File ~/checkouts/readthedocs.org/user_builds/pandera/conda/latest/lib/python3.12/site-packages/pandera/api/dataframe/model.py:73, in get_dtype_kwargs(annotation)
71 dtype_arg_names = list(sig.parameters.keys())
72 if len(annotation.metadata) != len(dtype_arg_names): # type: ignore
---> 73 raise TypeError(
74 f"Annotation '{annotation.arg.__name__}' requires " # type: ignore
75 + f"all positional arguments {dtype_arg_names}."
76 )
77 return dict(zip(dtype_arg_names, annotation.metadata))
TypeError: Annotation 'DatetimeTZDtype' requires all positional arguments ['unit', 'tz'].
Required Columns¶
By default all columns specified in the schema are required, meaning
that if a column is missing in the input DataFrame an exception will be
thrown. If you want to make a column optional, annotate it with typing.Optional
.
from typing import Optional
import pandas as pd
import pandera as pa
from pandera.typing import Series
class Schema(pa.DataFrameModel):
a: Series[str]
b: Optional[Series[int]]
df = pd.DataFrame({"a": ["2001", "2002", "2003"]})
Schema.validate(df)
a | |
---|---|
0 | 2001 |
1 | 2002 |
2 | 2003 |
Schema Inheritance¶
You can also use inheritance to build schemas on top of a base schema.
class BaseSchema(pa.DataFrameModel):
year: Series[str]
class FinalSchema(BaseSchema):
year: Series[int] = pa.Field(ge=2000, coerce=True) # overwrite the base type
passengers: Series[int]
idx: Index[int] = pa.Field(ge=0)
df = pd.DataFrame({
"year": ["2000", "2001", "2002"],
})
@pa.check_types
def transform(df: DataFrame[BaseSchema]) -> DataFrame[FinalSchema]:
return (
df.assign(passengers=[61000, 50000, 45000])
.set_index(pd.Index([1, 2, 3]))
.astype({"year": int})
)
transform(df)
year | passengers | |
---|---|---|
1 | 2000 | 61000 |
2 | 2001 | 50000 |
3 | 2002 | 45000 |
Config¶
Schema-wide options can be controlled via the Config
class on the DataFrameModel
subclass. The full set of options can be found in the BaseConfig
class.
class Schema(pa.DataFrameModel):
year: Series[int] = pa.Field(gt=2000, coerce=True)
month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
day: Series[int] = pa.Field(ge=0, le=365, coerce=True)
class Config:
name = "BaseSchema"
strict = True
coerce = True
foo = "bar" # Interpreted as dataframe check
baz = ... # Interpreted as a dataframe check with no additional arguments
It is not required for the Config
to subclass
BaseConfig
but
it must be named ‘Config’.
See Registered Custom Checks with the Class-based API for details on using registered dataframe checks.
MultiIndex¶
The MultiIndex
capabilities are also supported with
the class-based API:
import pandera as pa
from pandera.typing import Index, Series
class MultiIndexSchema(pa.DataFrameModel):
year: Index[int] = pa.Field(gt=2000, coerce=True)
month: Index[int] = pa.Field(ge=1, le=12, coerce=True)
passengers: Series[int]
class Config:
# provide multi index options in the config
multiindex_name = "time"
multiindex_strict = True
multiindex_coerce = True
index = MultiIndexSchema.to_schema().index
print(index)
<Schema MultiIndex(
indexes=[
<Schema Index(name=year, type=DataType(int64))>
<Schema Index(name=month, type=DataType(int64))>
]
coerce=True,
strict=True,
name=time,
ordered=True
)>
from pprint import pprint
pprint({name: col.checks for name, col in index.columns.items()})
{'month': [<Check greater_than_or_equal_to: greater_than_or_equal_to(1)>,
<Check less_than_or_equal_to: less_than_or_equal_to(12)>],
'year': [<Check greater_than: greater_than(2000)>]}
Multiple Index
annotations are automatically converted into a
MultiIndex
. MultiIndex options are given in the
Config.
Index Name¶
Use check_name
to validate the index name of a single-index dataframe:
import pandas as pd
import pandera as pa
from pandera.typing import Index, Series
class Schema(pa.DataFrameModel):
year: Series[int] = pa.Field(gt=2000, coerce=True)
passengers: Series[int]
idx: Index[int] = pa.Field(ge=0, check_name=True)
df = pd.DataFrame({
"year": [2001, 2002, 2003],
"passengers": [61000, 50000, 45000],
})
try:
Schema.validate(df)
except pa.errors.SchemaError as exc:
print(exc)
Expected <class 'pandas.core.series.Series'> to have name 'idx', found 'None'
check_name
default value of None
translates to True
for columns and multi-index.
Custom Checks¶
Unlike the object-based API, custom checks can be specified as class methods.
Column/Index checks¶
import pandera as pa
from pandera.typing import Index, Series
class CustomCheckSchema(pa.DataFrameModel):
a: Series[int] = pa.Field(gt=0, coerce=True)
abc: Series[int]
idx: Index[str]
@pa.check("a", name="foobar")
def custom_check(cls, a: Series[int]) -> Series[bool]:
return a < 100
@pa.check("^a", regex=True, name="foobar")
def custom_check_regex(cls, a: Series[int]) -> Series[bool]:
return a > 0
@pa.check("idx")
def check_idx(cls, idx: Index[int]) -> Series[bool]:
return idx.str.contains("dog")
Note
You can supply the key-word arguments of the
Check
class initializer to get the flexibility of groupby checksSimilarly to
pydantic
,classmethod()
decorator is added behind the scenes if omitted.You still may need to add the
@classmethod
decorator after thecheck()
decorator if your static-type checker or linter complains.Since
checks
are class methods, the first argument value they receive is a DataFrameModel subclass, not an instance of a model.
from typing import Dict
class GroupbyCheckSchema(pa.DataFrameModel):
value: Series[int] = pa.Field(gt=0, coerce=True)
group: Series[str] = pa.Field(isin=["A", "B"])
@pa.check("value", groupby="group", regex=True, name="check_means")
def check_groupby(cls, grouped_value: Dict[str, Series[int]]) -> bool:
return grouped_value["A"].mean() < grouped_value["B"].mean()
df = pd.DataFrame({
"value": [100, 110, 120, 10, 11, 12],
"group": list("AAABBB"),
})
try:
print(GroupbyCheckSchema.validate(df))
except pa.errors.SchemaError as exc:
print(exc)
Column 'value' failed series or dataframe validator 1: <Check check_means>
DataFrame Checks¶
You can also define dataframe-level checks, similar to the
object-based API, using the
dataframe_check()
decorator:
import pandas as pd
import pandera as pa
from pandera.typing import Index, Series
class DataFrameCheckSchema(pa.DataFrameModel):
col1: Series[int] = pa.Field(gt=0, coerce=True)
col2: Series[float] = pa.Field(gt=0, coerce=True)
col3: Series[float] = pa.Field(lt=0, coerce=True)
@pa.dataframe_check
def product_is_negative(cls, df: pd.DataFrame) -> Series[bool]:
return df["col1"] * df["col2"] * df["col3"] < 0
df = pd.DataFrame({
"col1": [1, 2, 3],
"col2": [5, 6, 7],
"col3": [-1, -2, -3],
})
DataFrameCheckSchema.validate(df)
col1 | col2 | col3 | |
---|---|---|---|
0 | 1 | 5.0 | -1.0 |
1 | 2 | 6.0 | -2.0 |
2 | 3 | 7.0 | -3.0 |
Inheritance¶
The custom checks are inherited and therefore can be overwritten by the subclass.
import pandas as pd
import pandera as pa
from pandera.typing import Index, Series
class Parent(pa.DataFrameModel):
a: Series[int] = pa.Field(coerce=True)
@pa.check("a", name="foobar")
def check_a(cls, a: Series[int]) -> Series[bool]:
return a < 100
class Child(Parent):
a: Series[int] = pa.Field(coerce=False)
@pa.check("a", name="foobar")
def check_a(cls, a: Series[int]) -> Series[bool]:
return a > 100
is_a_coerce = Child.to_schema().columns["a"].coerce
print(f"coerce: {is_a_coerce}")
coerce: False
df = pd.DataFrame({"a": [1, 2, 3]})
try:
Child.validate(df)
except pa.errors.SchemaError as exc:
print(exc)
Column 'a' failed element-wise validator number 0: <Check foobar> failure cases: 1, 2, 3
Aliases¶
DataFrameModel
supports columns which are not valid python variable names via the argument
alias
of Field
.
Checks must reference the aliased names.
import pandera as pa
import pandas as pd
class Schema(pa.DataFrameModel):
col_2020: pa.typing.Series[int] = pa.Field(alias=2020)
idx: pa.typing.Index[int] = pa.Field(alias="_idx", check_name=True)
@pa.check(2020)
def int_column_lt_100(cls, series):
return series < 100
df = pd.DataFrame({2020: [99]}, index=[0])
df.index.name = "_idx"
print(Schema.validate(df))
2020
_idx
0 99
(New in 0.6.2) The alias
is respected when using the class attribute to get the underlying
pd.DataFrame
column name or index level name.
print(Schema.col_2020)
2020
Very similar to the example above, you can also use the variable name directly within the class scope, and it will respect the alias.
Note
To access a variable from the class scope, you need to make it a class attribute,
and therefore assign it a default Field
.
import pandera as pa
import pandas as pd
class Schema(pa.DataFrameModel):
a: pa.typing.Series[int] = pa.Field()
col_2020: pa.typing.Series[int] = pa.Field(alias=2020)
@pa.check(col_2020)
def int_column_lt_100(cls, series):
return series < 100
@pa.check(a)
def int_column_gt_100(cls, series):
return series > 100
df = pd.DataFrame({2020: [99], "a": [101]})
print(Schema.validate(df))
2020 a
0 99 101
Manipulating DataFrame Models post-definition¶
One caveat of using inheritance to build schemas on top of each other is that there is no clear way of how a child class can e.g. remove fields or update them without completely overriding previous settings. This is because inheritance is strictly additive.
DataFrameSchema
objects do have these options though, as described in
DataFrameSchema Transformations, which you can leverage by overriding your
DataFrame Model’s to_schema()
method.
DataFrame Models are for the most part just a proxy for the DataFrameSchema
API; calling
validate()
will just redirect to the validate method of
the Data Frame Schema’s validate
returned by
to_schema
. As such, any updates to the schema that took place in there will propagate
cleanly.
As an example, the following class hierarchy can not remove the fields b
and c
from
Baz
into a base-class without completely convoluting the inheritance tree. So, we can
get rid of them like this:
import pandera as pa
import pandas as pd
class Foo(pa.DataFrameModel):
a: pa.typing.Series[int]
b: pa.typing.Series[int]
class Bar(pa.DataFrameModel):
c: pa.typing.Series[int]
d: pa.typing.Series[int]
class Baz(Foo, Bar):
@classmethod
def to_schema(cls) -> pa.DataFrameSchema:
schema = super().to_schema()
return schema.remove_columns(["b", "c"])
df = pd.DataFrame({"a": [99], "d": [101]})
print(Baz.validate(df))
a d
0 99 101
Note
There are drawbacks to manipulating schema shape in this way:
Static code analysis has no way to figure out what fields have been removed/updated from the class definitions and inheritance hierarchy.
Any children of classes which have overriden
to_schema
might experience surprising behavior – if a child ofBaz
tries to define a fieldb
orc
again, it will lose it in itsto_schema
call becauseBaz
’sto_schema
will always be executed after any child’s class body has already been fully assembled.