A Statistical Data Testing Toolkit#
A data validation library for scientists, engineers, and analysts seeking correctness.
pandera
provides a flexible and expressive API for performing data
validation on dataframe-like objects to make data processing pipelines more
readable and robust.
Dataframes contain information that pandera
explicitly validates at runtime.
This is useful in production-critical data pipelines or reproducible research
settings. With pandera
, you can:
Define a schema once and use it to validate different dataframe types including pandas, dask, modin, and pyspark.pandas.
Check the types and properties of columns in a
pd.DataFrame
or values in apd.Series
.Perform more complex statistical validation like hypothesis testing.
Seamlessly integrate with existing data analysis/processing pipelines via function decorators.
Define schema models with the class-based API with pydantic-style syntax and validate dataframes using the typing syntax.
Synthesize data from schema objects for property-based testing with pandas data structures.
Lazily Validate dataframes so that all validation rules are executed before raising an error.
Integrate with a rich ecosystem of python tools like pydantic, fastapi and mypy.
Install#
Install with pip
:
pip install pandera
Or conda
:
conda install -c conda-forge pandera
Extras#
Installing additional functionality:
pip install pandera[hypotheses] # hypothesis checks
pip install pandera[io] # yaml/script schema io utilities
pip install pandera[strategies] # data synthesis strategies
pip install pandera[mypy] # enable static type-linting of pandas
pip install pandera[fastapi] # fastapi integration
pip install pandera[dask] # validate dask dataframes
pip install pandera[pyspark] # validate pyspark dataframes
pip install pandera[modin] # validate modin dataframes
pip install pandera[modin-ray] # validate modin dataframes with ray
pip install pandera[modin-dask] # validate modin dataframes with dask
pip install pandera[geopandas] # validate geopandas geodataframes
conda install -c conda-forge pandera-hypotheses # hypothesis checks
conda install -c conda-forge pandera-io # yaml/script schema io utilities
conda install -c conda-forge pandera-strategies # data synthesis strategies
conda install -c conda-forge pandera-mypy # enable static type-linting of pandas
conda install -c conda-forge pandera-fastapi # fastapi integration
conda install -c conda-forge pandera-dask # validate dask dataframes
conda install -c conda-forge pandera-pyspark # validate pyspark dataframes
conda install -c conda-forge pandera-modin # validate modin dataframes
conda install -c conda-forge pandera-modin-ray # validate modin dataframes with ray
conda install -c conda-forge pandera-modin-dask # validate modin dataframes with dask
conda install -c conda-forge pandera-geopandas # validate geopandas geodataframes
Quick Start#
import pandas as pd
import pandera as pa
# data to validate
df = pd.DataFrame({
"column1": [1, 4, 0, 10, 9],
"column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
"column3": ["value_1", "value_2", "value_3", "value_2", "value_1"],
})
# define schema
schema = pa.DataFrameSchema({
"column1": pa.Column(int, checks=pa.Check.le(10)),
"column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
"column3": pa.Column(str, checks=[
pa.Check.str_startswith("value_"),
# define custom checks as functions that take a series as input and
# outputs a boolean or boolean Series
pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
]),
})
validated_df = schema(df)
print(validated_df)
column1 column2 column3
0 1 -1.3 value_1
1 4 -1.4 value_2
2 0 -2.9 value_3
3 10 -10.1 value_2
4 9 -20.4 value_1
You can pass the built-in python types that are supported by
pandas, or strings representing the
legal pandas datatypes,
or pandera’s DataType
:
schema = pa.DataFrameSchema({
# built-in python types
"int_column": pa.Column(int),
"float_column": pa.Column(float),
"str_column": pa.Column(str),
# pandas dtype string aliases
"int_column2": pa.Column("int64"),
"float_column2": pa.Column("float64"),
# pandas > 1.0.0 support native "string" type
"str_column2": pa.Column("str"),
# pandera DataType
"int_column3": pa.Column(pa.Int),
"float_column3": pa.Column(pa.Float),
"str_column3": pa.Column(pa.String),
})
For more details on data types, see DataType
Schema Model#
pandera
also provides an alternative API for expressing schemas inspired
by dataclasses and
pydantic. The equivalent
SchemaModel
for the above
DataFrameSchema
would be:
from pandera.typing import Series
class Schema(pa.SchemaModel):
column1: Series[int] = pa.Field(le=10)
column2: Series[float] = pa.Field(lt=-1.2)
column3: Series[str] = pa.Field(str_startswith="value_")
@pa.check("column3")
def column_3_check(cls, series: Series[str]) -> Series[bool]:
"""Check that column3 values have two elements after being split with '_'"""
return series.str.split("_", expand=True).shape[1] == 2
Schema.validate(df)
Informative Errors#
If the dataframe does not pass validation checks, pandera
provides
useful error messages. An error
argument can also be supplied to
Check
for custom error messages.
In the case that a validation Check
is violated:
import pandas as pd
from pandera import Column, DataFrameSchema, Int, Check
simple_schema = DataFrameSchema({
"column1": Column(
Int, Check(lambda x: 0 <= x <= 10, element_wise=True,
error="range checker [0, 10]"))
})
# validation rule violated
fail_check_df = pd.DataFrame({
"column1": [-20, 5, 10, 30],
})
simple_schema(fail_check_df)
Traceback (most recent call last):
...
SchemaError: <Schema Column: 'column1' type=<class 'int'>> failed element-wise validator 0:
<Check <lambda>: range checker [0, 10]>
failure cases:
index failure_case
0 0 -20
1 3 30
And in the case of a mis-specified column name:
# column name mis-specified
wrong_column_df = pd.DataFrame({
"foo": ["bar"] * 10,
"baz": [1] * 10
})
simple_schema.validate(wrong_column_df)
Traceback (most recent call last):
...
pandera.SchemaError: column 'column1' not in dataframe
foo baz
0 bar 1
1 bar 1
2 bar 1
3 bar 1
4 bar 1
Contributing#
All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.
A detailed overview on how to contribute can be found in the contributing guide on GitHub.
Issues#
Submit issues, feature requests or bugfixes on github.
Need Help?#
There are many ways of getting help with your questions. You can ask a question on Github Discussions page or reach out to the maintainers and pandera community on Discord
DataFrame Schemas#
The DataFrameSchema
class enables the specification of a schema
that verifies the columns and index of a pandas DataFrame
object.
The DataFrameSchema
object consists of Column
s and an Index
.
import pandera as pa
from pandera import Column, DataFrameSchema, Check, Index
schema = DataFrameSchema(
{
"column1": Column(int),
"column2": Column(float, Check(lambda s: s < -1.2)),
# you can provide a list of validators
"column3": Column(str, [
Check(lambda s: s.str.startswith("value")),
Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
]),
},
index=Index(int),
strict=True,
coerce=True,
)
You can refer to Schema Models to see how to define dataframe schemas using the alternative pydantic/dataclass-style syntax.
Column Validation#
A Column
must specify the properties of a
column in a dataframe object. It can be optionally verified for its data type,
null values or
duplicate values. The column can be coerced into the specified type, and the
required parameter allows control over whether or not the column is allowed to
be missing.
Similarly to pandas, the data type can be specified as:
a string alias, as long as it is recognized by pandas.
a python type: int, float, double, bool, str
a pandas extension type: it can be an instance (e.g pd.CategoricalDtype([“a”, “b”])) or a class (e.g pandas.CategoricalDtype) if it can be initialized with default values.
a pandera
DataType
: it can also be an instance or a class.
Column checks allow for the DataFrame’s values to be
checked against a user-provided function. Check
objects also support
grouping by a different column so that the user can make
assertions about subsets of the column of interest.
Column Hypotheses enable you to perform statistical hypothesis tests on a DataFrame in either wide or tidy format. See Hypothesis Testing for more details.
Null Values in Columns#
By default, SeriesSchema/Column objects assume that values are not
nullable. In order to accept null values, you need to explicitly specify
nullable=True
, or else you’ll get an error.
import numpy as np
import pandas as pd
import pandera as pa
from pandera import Check, Column, DataFrameSchema
df = pd.DataFrame({"column1": [5, 1, np.nan]})
non_null_schema = DataFrameSchema({
"column1": Column(float, Check(lambda x: x > 0))
})
non_null_schema.validate(df)
Traceback (most recent call last):
...
SchemaError: non-nullable series contains null values: {2: nan}
null_schema = DataFrameSchema({
"column1": Column(float, Check(lambda x: x > 0), nullable=True)
})
print(null_schema.validate(df))
column1
0 5.0
1 1.0
2 NaN
Coercing Types on Columns#
If you specify Column(dtype, ..., coerce=True)
as part of the
DataFrameSchema definition, calling schema.validate
will first
coerce the column into the specified dtype
before applying validation
checks.
import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema
df = pd.DataFrame({"column1": [1, 2, 3]})
schema = DataFrameSchema({"column1": Column(str, coerce=True)})
validated_df = schema.validate(df)
assert isinstance(validated_df.column1.iloc[0], str)
Note
Note the special case of integers columns not supporting nan
values. In this case, schema.validate
will complain if coerce == True
and null values are allowed in the column.
df = pd.DataFrame({"column1": [1., 2., 3, np.nan]})
schema = DataFrameSchema({
"column1": Column(int, coerce=True, nullable=True)
})
validated_df = schema.validate(df)
Traceback (most recent call last):
...
pandera.errors.SchemaError: Error while coercing 'column1' to type int64: Cannot convert non-finite values (NA or inf) to integer
The best way to handle this case is to simply specify the column as a
Float
or Object
.
schema_object = DataFrameSchema({
"column1": Column(object, coerce=True, nullable=True)
})
schema_float = DataFrameSchema({
"column1": Column(float, coerce=True, nullable=True)
})
print(schema_object.validate(df).dtypes)
print(schema_float.validate(df).dtypes)
column1 object
dtype: object
column1 float64
dtype: object
If you want to coerce all of the columns specified in the
DataFrameSchema
, you can specify the coerce
argument with
DataFrameSchema(..., coerce=True)
.
Required Columns#
By default all columns specified in the schema are required, meaning
that if a column is missing in the input DataFrame an exception will be
thrown. If you want to make a column optional, specify required=False
in the column constructor:
import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema
df = pd.DataFrame({"column2": ["hello", "pandera"]})
schema = DataFrameSchema({
"column1": Column(int, required=False),
"column2": Column(str)
})
validated_df = schema.validate(df)
print(validated_df)
column2
0 hello
1 pandera
Since required=True
by default, missing columns would raise an error:
schema = DataFrameSchema({
"column1": Column(int),
"column2": Column(str),
})
schema.validate(df)
Traceback (most recent call last):
...
pandera.SchemaError: column 'column1' not in dataframe
column2
0 hello
1 pandera
Ordered Columns#
Stand-alone Column Validation#
In addition to being used in the context of a DataFrameSchema
, Column
objects can also be used to validate columns in a dataframe on its own:
import pandas as pd
import pandera as pa
df = pd.DataFrame({
"column1": [1, 2, 3],
"column2": ["a", "b", "c"],
})
column1_schema = pa.Column(int, name="column1")
column2_schema = pa.Column(str, name="column2")
# pass the dataframe as an argument to the Column object callable
df = column1_schema(df)
validated_df = column2_schema(df)
# or explicitly use the validate method
df = column1_schema.validate(df)
validated_df = column2_schema.validate(df)
# use the DataFrame.pipe method to validate two columns
validated_df = df.pipe(column1_schema).pipe(column2_schema)
For multi-column use cases, the DataFrameSchema
is still recommended, but
if you have one or a small number of columns to verify, using Column
objects by themselves is appropriate.
Column Regex Pattern Matching#
In the case that your dataframe has multiple columns that share common
statistical properties, you might want to specify a regex pattern that matches
a set of meaningfully grouped columns that have str
names.
import numpy as np
import pandas as pd
import pandera as pa
categories = ["A", "B", "C"]
np.random.seed(100)
dataframe = pd.DataFrame({
"cat_var_1": np.random.choice(categories, size=100),
"cat_var_2": np.random.choice(categories, size=100),
"num_var_1": np.random.uniform(0, 10, size=100),
"num_var_2": np.random.uniform(20, 30, size=100),
})
schema = pa.DataFrameSchema({
"num_var_.+": pa.Column(
float,
checks=pa.Check.greater_than_or_equal_to(0),
regex=True,
),
"cat_var_.+": pa.Column(
pa.Category,
checks=pa.Check.isin(categories),
coerce=True,
regex=True,
),
})
print(schema.validate(dataframe).head())
cat_var_1 cat_var_2 num_var_1 num_var_2
0 A A 6.804147 24.743304
1 A C 3.684308 22.774633
2 A C 5.911288 28.416588
3 C A 4.790627 21.951250
4 C B 4.504166 28.563142
You can also regex pattern match on pd.MultiIndex
columns:
np.random.seed(100)
dataframe = pd.DataFrame({
("cat_var_1", "y1"): np.random.choice(categories, size=100),
("cat_var_2", "y2"): np.random.choice(categories, size=100),
("num_var_1", "x1"): np.random.uniform(0, 10, size=100),
("num_var_2", "x2"): np.random.uniform(0, 10, size=100),
})
schema = pa.DataFrameSchema({
("num_var_.+", "x.+"): pa.Column(
float,
checks=pa.Check.greater_than_or_equal_to(0),
regex=True,
),
("cat_var_.+", "y.+"): pa.Column(
pa.Category,
checks=pa.Check.isin(categories),
coerce=True,
regex=True,
),
})
print(schema.validate(dataframe).head())
cat_var_1 cat_var_2 num_var_1 num_var_2
y1 y2 x1 x2
0 A A 6.804147 4.743304
1 A C 3.684308 2.774633
2 A C 5.911288 8.416588
3 C A 4.790627 1.951250
4 C B 4.504166 8.563142
Handling Dataframe Columns not in the Schema#
By default, columns that aren’t specified in the schema aren’t checked.
If you want to check that the DataFrame only contains columns in the
schema, specify strict=True
:
import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema
schema = DataFrameSchema(
{"column1": Column(int)},
strict=True)
df = pd.DataFrame({"column2": [1, 2, 3]})
schema.validate(df)
Traceback (most recent call last):
...
SchemaError: column 'column2' not in DataFrameSchema {'column1': <Schema Column: 'None' type=DataType(int64)>}
Alternatively, if your DataFrame contains columns that are not in the schema,
and you would like these to be dropped on validation,
you can specify strict='filter'
.
import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema
df = pd.DataFrame({"column1": ["drop", "me"],"column2": ["keep", "me"]})
schema = DataFrameSchema({"column2": Column(str)}, strict='filter')
validated_df = schema.validate(df)
print(validated_df)
column2
0 keep
1 me
Validating the order of the columns#
For some applications the order of the columns is important. For example:
If you want to use selection by position instead of the more common selection by label.
Machine learning: Many ML libraries will cast a Dataframe to numpy arrays, for which order becomes crucial.
To validate the order of the Dataframe columns, specify ordered=True
:
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema(
columns={"a": pa.Column(int), "b": pa.Column(int)}, ordered=True
)
df = pd.DataFrame({"b": [1], "a": [1]})
print(schema.validate(df))
Traceback (most recent call last):
...
SchemaError: column 'b' out-of-order
Validating the joint uniqueness of columns#
In some cases you might want to ensure that a group of columns are unique:
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema(
columns={col: pa.Column(int) for col in ["a", "b", "c"]},
unique=["a", "c"],
)
df = pd.DataFrame.from_records([
{"a": 1, "b": 2, "c": 3},
{"a": 1, "b": 2, "c": 3},
])
schema.validate(df)
Traceback (most recent call last):
...
SchemaError: columns '('a', 'c')' not unique:
column index failure_case
0 a 0 1
1 a 1 1
2 c 0 3
3 c 1 3
- To control how unique errors are reported, the report_duplicates argument accepts:
exclude_first: (default) report all duplicates except first occurence
exclude_last: report all duplicates except last occurence
all: report all duplicates
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema(
columns={col: pa.Column(int) for col in ["a", "b", "c"]},
unique=["a", "c"],
report_duplicates = "exclude_first",
)
df = pd.DataFrame.from_records([
{"a": 1, "b": 2, "c": 3},
{"a": 1, "b": 2, "c": 3},
])
schema.validate(df)
Traceback (most recent call last):
...
SchemaError: columns '('a', 'c')' not unique:
column index failure_case
0 a 1 1
1 c 1 3
Index Validation#
You can also specify an Index
in the DataFrameSchema
.
import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema, Index, Check
schema = DataFrameSchema(
columns={"a": Column(int)},
index=Index(
str,
Check(lambda x: x.str.startswith("index_"))))
df = pd.DataFrame(
data={"a": [1, 2, 3]},
index=["index_1", "index_2", "index_3"])
print(schema.validate(df))
a
index_1 1
index_2 2
index_3 3
In the case that the DataFrame index doesn’t pass the Check
.
df = pd.DataFrame(
data={"a": [1, 2, 3]},
index=["foo1", "foo2", "foo3"])
schema.validate(df)
Traceback (most recent call last):
...
SchemaError: <Schema Index> failed element-wise validator 0:
<lambda>
failure cases:
index count
failure_case
foo1 [0] 1
foo2 [1] 1
foo3 [2] 1
MultiIndex Validation#
pandera
also supports multi-index column and index validation.
MultiIndex Columns#
Specifying multi-index columns follows the pandas
syntax of specifying
tuples for each level in the index hierarchy:
import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema, Index
schema = DataFrameSchema({
("foo", "bar"): Column(int),
("foo", "baz"): Column(str)
})
df = pd.DataFrame({
("foo", "bar"): [1, 2, 3],
("foo", "baz"): ["a", "b", "c"],
})
print(schema.validate(df))
foo
bar baz
0 1 a
1 2 b
2 3 c
MultiIndex Indexes#
The MultiIndex
class allows you to define multi-index
indexes by composing a list of pandera.Index
objects.
import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema, Index, MultiIndex, Check
schema = DataFrameSchema(
columns={"column1": Column(int)},
index=MultiIndex([
Index(str,
Check(lambda s: s.isin(["foo", "bar"])),
name="index0"),
Index(int, name="index1"),
])
)
df = pd.DataFrame(
data={"column1": [1, 2, 3]},
index=pd.MultiIndex.from_arrays(
[["foo", "bar", "foo"], [0, 1,2 ]],
names=["index0", "index1"]
)
)
print(schema.validate(df))
column1
index0 index1
foo 0 1
bar 1 2
foo 2 3
Get Pandas Data Types#
Pandas provides a dtype parameter for casting a dataframe to a specific dtype
schema. DataFrameSchema
provides
a dtypes
property which returns a
dictionary whose keys are column names and values are DataType
.
Some examples of where this can be provided to pandas are:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema(
columns={
"column1": pa.Column(int),
"column2": pa.Column(pa.Category),
"column3": pa.Column(bool)
},
)
df = (
pd.DataFrame.from_dict(
{
"a": {"column1": 1, "column2": "valueA", "column3": True},
"b": {"column1": 1, "column2": "valueB", "column3": True},
},
orient="index",
)
.astype({col: str(dtype) for col, dtype in schema.dtypes.items()})
.sort_index(axis=1)
)
print(schema.validate(df))
column1 column2 column3
a 1 valueA True
b 1 valueB True
DataFrameSchema Transformations#
Once you’ve defined a schema, you can then make modifications to it, both on the schema level – such as adding or removing columns and setting or resetting the index – or on the column level – such as changing the data type or checks.
This is useful for re-using schema objects in a data pipeline when additional computation has been done on a dataframe, where the column objects may have changed or perhaps where additional checks may be required.
import pandas as pd
import pandera as pa
data = pd.DataFrame({"col1": range(1, 6)})
schema = pa.DataFrameSchema(
columns={"col1": pa.Column(int, pa.Check(lambda s: s >= 0))},
strict=True)
transformed_schema = schema.add_columns({
"col2": pa.Column(str, pa.Check(lambda s: s == "value")),
"col3": pa.Column(float, pa.Check(lambda x: x == 0.0)),
})
# validate original data
data = schema.validate(data)
# transformation
transformed_data = data.assign(col2="value", col3=0.0)
# validate transformed data
print(transformed_schema.validate(transformed_data))
col1 col2 col3
0 1 value 0.0
1 2 value 0.0
2 3 value 0.0
3 4 value 0.0
4 5 value 0.0
Similarly, if you want dropped columns to be explicitly validated in a data pipeline:
import pandera as pa
schema = pa.DataFrameSchema(
columns={
"col1": pa.Column(int, pa.Check(lambda s: s >= 0)),
"col2": pa.Column(str, pa.Check(lambda x: x <= 0)),
"col3": pa.Column(object, pa.Check(lambda x: x == 0)),
},
strict=True,
)
new_schema = schema.remove_columns(["col2", "col3"])
print(new_schema)
<Schema DataFrameSchema(
columns={
'col1': <Schema Column(name=col1, type=DataType(int64))>
},
checks=[],
coerce=False,
dtype=None,
index=None,
strict=True
name=None,
ordered=False,
unique_column_names=False
)>
If during the course of a data pipeline one of your columns is moved into the
index, you can simply update the initial input schema using the
set_index()
method to create a schema for
the pipeline output.
import pandera as pa
from pandera import Column, DataFrameSchema, Check, Index
schema = DataFrameSchema(
{
"column1": Column(int),
"column2": Column(float)
},
index=Index(int, name = "column3"),
strict=True,
coerce=True,
)
print(schema.set_index(["column1"], append = True))
<Schema DataFrameSchema(
columns={
'column2': <Schema Column(name=column2, type=DataType(float64))>
},
checks=[],
coerce=True,
dtype=None,
index=<Schema MultiIndex(
indexes=[
<Schema Index(name=column3, type=DataType(int64))>
<Schema Index(name=column1, type=DataType(int64))>
]
coerce=False,
strict=False,
name=None,
ordered=True
)>,
strict=True
name=None,
ordered=False,
unique_column_names=False
)>
The available methods for altering the schema are:
add_columns()
,
remove_columns()
,
update_columns()
,
rename_columns()
,
set_index()
,
and reset_index()
.
Schema Models#
new in 0.5.0
pandera
provides a class-based API that’s heavily inspired by
pydantic. In contrast to the
object-based API, you can define schema models in
much the same way you’d define pydantic
models.
Schema Models are annotated with the pandera.typing
module using the standard
typing syntax. Models can be
explicitly converted to a DataFrameSchema
or used to validate a
DataFrame
directly.
Note
Due to current limitations in the pandas library (see discussion
here),
pandera
annotations are only used for run-time validation and cannot be
leveraged by static-type checkers like mypy. See the
discussion here
for more details.
Basic Usage#
import pandas as pd
import pandera as pa
from pandera.typing import Index, DataFrame, Series
class InputSchema(pa.SchemaModel):
year: Series[int] = pa.Field(gt=2000, coerce=True)
month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
day: Series[int] = pa.Field(ge=0, le=365, coerce=True)
class OutputSchema(InputSchema):
revenue: Series[float]
@pa.check_types
def transform(df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
return df.assign(revenue=100.0)
df = pd.DataFrame({
"year": ["2001", "2002", "2003"],
"month": ["3", "6", "12"],
"day": ["200", "156", "365"],
})
transform(df)
invalid_df = pd.DataFrame({
"year": ["2001", "2002", "1999"],
"month": ["3", "6", "12"],
"day": ["200", "156", "365"],
})
transform(invalid_df)
Traceback (most recent call last):
...
pandera.errors.SchemaError: <Schema Column: 'year' type=DataType(int64)> failed element-wise validator 0:
<Check greater_than: greater_than(2000)>
failure cases:
index failure_case
0 2 1999
As you can see in the example above, you can define a schema by sub-classing
SchemaModel
and defining column/index fields as class attributes.
The check_types()
decorator is required to perform validation of the dataframe at
run-time.
Note that Field
s apply to both
Column
and Index
objects, exposing the built-in Check
s via key-word arguments.
(New in 0.6.2) When you access a class attribute defined on the schema, it will return the name of the column used in the validated pd.DataFrame. In the example above, this will simply be the string “year”.
print(f"Column name for 'year' is {InputSchema.year}\n")
print(df.loc[:, [InputSchema.year, "day"]])
Column name for 'year' is year
year day
0 2001 200
1 2002 156
2 2003 365
Validate on Initialization#
new in 0.8.0
Pandera provides an interface for validating dataframes on initialization.
This API uses the pandera.typing.pandas.DataFrame
generic type
to validated against the SchemaModel
type variable
on initialization:
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
class Schema(pa.SchemaModel):
state: Series[str]
city: Series[str]
price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})
df = DataFrame[Schema](
{
'state': ['NY','FL','GA','CA'],
'city': ['New York', 'Miami', 'Atlanta', 'San Francisco'],
'price': [8, 12, 10, 16],
}
)
print(df)
state city price
0 NY New York 8
1 FL Miami 12
2 GA Atlanta 10
3 CA San Francisco 16
Refer to Supported DataFrame Libraries to see how this syntax applies to other supported dataframe types.
Converting to DataFrameSchema#
You can easily convert a SchemaModel
class into a
DataFrameSchema
:
print(InputSchema.to_schema())
<Schema DataFrameSchema(
columns={
'year': <Schema Column(name=year, type=DataType(int64))>
'month': <Schema Column(name=month, type=DataType(int64))>
'day': <Schema Column(name=day, type=DataType(int64))>
},
checks=[],
coerce=False,
dtype=None,
index=None,
strict=False
name=InputSchema,
ordered=False,
unique_column_names=False
)>
You can also use the validate()
method to
validate dataframes:
print(InputSchema.validate(df))
year month day
0 2001 3 200
1 2002 6 156
2 2003 12 365
Or you can use the SchemaModel()
class directly to
validate dataframes, which is syntactic sugar that simply delegates to the
validate()
method.
print(InputSchema(df))
year month day
0 2001 3 200
1 2002 6 156
2 2003 12 365
Excluded attributes#
Class variables which begin with an underscore will be automatically excluded from the model. Config is also a reserved name. However, aliases can be used to circumvent these limitations.
Supported dtypes#
Any dtypes supported by pandera
can be used as type parameters for
Series
and Index
. There are,
however, a couple of gotchas.
Dtype aliases#
import pandera as pa
from pandera.typing import Series, String
class Schema(pa.SchemaModel):
a: Series[String]
Type Vs instance#
You must give a type, not an instance.
✔ Good:
import pandas as pd
class Schema(pa.SchemaModel):
a: Series[pd.StringDtype]
✘ Bad:
class Schema(pa.SchemaModel):
a: Series[pd.StringDtype()]
Traceback (most recent call last):
...
TypeError: Parameters to generic types must be types. Got string[python].
Parametrized dtypes#
Pandas supports a couple of parametrized dtypes. As of pandas 1.2.0:
Kind of Data |
Data Type |
Parameters |
---|---|---|
tz-aware datetime |
|
|
Categorical |
|
|
period |
|
|
sparse |
|
|
intervals |
|
|
Annotated#
Parameters can be given via typing.Annotated
. It requires python >= 3.9 or
typing_extensions, which is already a
requirement of Pandera. Unfortunately typing.Annotated
has not been backported
to python 3.6.
✔ Good:
try:
from typing import Annotated # python 3.9+
except ImportError:
from typing_extensions import Annotated
class Schema(pa.SchemaModel):
col: Series[Annotated[pd.DatetimeTZDtype, "ns", "est"]]
Furthermore, you must pass all parameters in the order defined in the dtype’s constructor (see table).
✘ Bad:
class Schema(pa.SchemaModel):
col: Series[Annotated[pd.DatetimeTZDtype, "utc"]]
Schema.to_schema()
Traceback (most recent call last):
...
TypeError: Annotation 'DatetimeTZDtype' requires all positional arguments ['unit', 'tz'].
Field#
✔ Good:
class SchemaFieldDatetimeTZDtype(pa.SchemaModel):
col: Series[pd.DatetimeTZDtype] = pa.Field(dtype_kwargs={"unit": "ns", "tz": "EST"})
You cannot use both typing.Annotated
and dtype_kwargs
.
✘ Bad:
class SchemaFieldDatetimeTZDtype(pa.SchemaModel):
col: Series[Annotated[pd.DatetimeTZDtype, "ns", "est"]] = pa.Field(dtype_kwargs={"unit": "ns", "tz": "EST"})
Schema.to_schema()
Traceback (most recent call last):
...
TypeError: Cannot specify redundant 'dtype_kwargs' for pandera.typing.Series[typing_extensions.Annotated[pandas.core.dtypes.dtypes.DatetimeTZDtype, 'ns', 'est']].
Usage Tip: Drop 'typing.Annotated'.
Required Columns#
By default all columns specified in the schema are required, meaning
that if a column is missing in the input DataFrame an exception will be
thrown. If you want to make a column optional, annotate it with typing.Optional
.
from typing import Optional
import pandas as pd
import pandera as pa
from pandera.typing import Series
class Schema(pa.SchemaModel):
a: Series[str]
b: Optional[Series[int]]
df = pd.DataFrame({"a": ["2001", "2002", "2003"]})
Schema.validate(df)
Schema Inheritance#
You can also use inheritance to build schemas on top of a base schema.
class BaseSchema(pa.SchemaModel):
year: Series[str]
class FinalSchema(BaseSchema):
year: Series[int] = pa.Field(ge=2000, coerce=True) # overwrite the base type
passengers: Series[int]
idx: Index[int] = pa.Field(ge=0)
df = pd.DataFrame({
"year": ["2000", "2001", "2002"],
})
@pa.check_types
def transform(df: DataFrame[BaseSchema]) -> DataFrame[FinalSchema]:
return (
df.assign(passengers=[61000, 50000, 45000])
.set_index(pd.Index([1, 2, 3]))
.astype({"year": int})
)
print(transform(df))
year passengers
1 2000 61000
2 2001 50000
3 2002 45000
Config#
Schema-wide options can be controlled via the Config
class on the SchemaModel
subclass. The full set of options can be found in the BaseConfig
class.
class Schema(pa.SchemaModel):
year: Series[int] = pa.Field(gt=2000, coerce=True)
month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
day: Series[int] = pa.Field(ge=0, le=365, coerce=True)
class Config:
name = "BaseSchema"
strict = True
coerce = True
foo = "bar" # Interpreted as dataframe check
It is not required for the Config
to subclass BaseConfig
but
it must be named ‘Config’.
See Registered Custom Checks with the Class-based API for details on using registered dataframe checks.
MultiIndex#
The MultiIndex
capabilities are also supported with
the class-based API:
import pandera as pa
from pandera.typing import Index, Series
class MultiIndexSchema(pa.SchemaModel):
year: Index[int] = pa.Field(gt=2000, coerce=True)
month: Index[int] = pa.Field(ge=1, le=12, coerce=True)
passengers: Series[int]
class Config:
# provide multi index options in the config
multiindex_name = "time"
multiindex_strict = True
multiindex_coerce = True
index = MultiIndexSchema.to_schema().index
print(index)
<Schema MultiIndex(
indexes=[
<Schema Index(name=year, type=DataType(int64))>
<Schema Index(name=month, type=DataType(int64))>
]
coerce=True,
strict=True,
name=time,
ordered=True
)>
from pprint import pprint
pprint({name: col.checks for name, col in index.columns.items()})
{'month': [<Check greater_than_or_equal_to: greater_than_or_equal_to(1)>,
<Check less_than_or_equal_to: less_than_or_equal_to(12)>],
'year': [<Check greater_than: greater_than(2000)>]}
Multiple Index
annotations are automatically converted into a
MultiIndex
. MultiIndex options are given in the
Config.
Index Name#
Use check_name
to validate the index name of a single-index dataframe:
import pandas as pd
import pandera as pa
from pandera.typing import Index, Series
class Schema(pa.SchemaModel):
year: Series[int] = pa.Field(gt=2000, coerce=True)
passengers: Series[int]
idx: Index[int] = pa.Field(ge=0, check_name=True)
df = pd.DataFrame({
"year": [2001, 2002, 2003],
"passengers": [61000, 50000, 45000],
})
Schema.validate(df)
Traceback (most recent call last):
...
pandera.errors.SchemaError: Expected <class 'pandera.schema_components.Index'> to have name 'idx', found 'None'
check_name
default value of None
translates to True
for columns and multi-index.
Custom Checks#
Unlike the object-based API, custom checks can be specified as class methods.
Column/Index checks#
import pandera as pa
from pandera.typing import Index, Series
class CustomCheckSchema(pa.SchemaModel):
a: Series[int] = pa.Field(gt=0, coerce=True)
abc: Series[int]
idx: Index[str]
@pa.check("a", name="foobar")
def custom_check(cls, a: Series[int]) -> Series[bool]:
return a < 100
@pa.check("^a", regex=True, name="foobar")
def custom_check_regex(cls, a: Series[int]) -> Series[bool]:
return a > 0
@pa.check("idx")
def check_idx(cls, idx: Index[int]) -> Series[bool]:
return idx.str.contains("dog")
Note
You can supply the key-word arguments of the
Check
class initializer to get the flexibility of groupby checksSimilarly to
pydantic
,classmethod()
decorator is added behind the scenes if omitted.You still may need to add the
@classmethod
decorator after thecheck()
decorator if your static-type checker or linter complains.Since
checks
are class methods, the first argument value they receive is a SchemaModel subclass, not an instance of a model.
from typing import Dict
class GroupbyCheckSchema(pa.SchemaModel):
value: Series[int] = pa.Field(gt=0, coerce=True)
group: Series[str] = pa.Field(isin=["A", "B"])
@pa.check("value", groupby="group", regex=True, name="check_means")
def check_groupby(cls, grouped_value: Dict[str, Series[int]]) -> bool:
return grouped_value["A"].mean() < grouped_value["B"].mean()
df = pd.DataFrame({
"value": [100, 110, 120, 10, 11, 12],
"group": list("AAABBB"),
})
print(GroupbyCheckSchema.validate(df))
Traceback (most recent call last):
...
pandera.errors.SchemaError: <Schema Column: 'value' type=DataType(int64)> failed series validator 1:
<Check check_means>
DataFrame Checks#
You can also define dataframe-level checks, similar to the
object-based API, using the
dataframe_check()
decorator:
import pandas as pd
import pandera as pa
from pandera.typing import Index, Series
class DataFrameCheckSchema(pa.SchemaModel):
col1: Series[int] = pa.Field(gt=0, coerce=True)
col2: Series[float] = pa.Field(gt=0, coerce=True)
col3: Series[float] = pa.Field(lt=0, coerce=True)
@pa.dataframe_check
def product_is_negative(cls, df: pd.DataFrame) -> Series[bool]:
return df["col1"] * df["col2"] * df["col3"] < 0
df = pd.DataFrame({
"col1": [1, 2, 3],
"col2": [5, 6, 7],
"col3": [-1, -2, -3],
})
DataFrameCheckSchema.validate(df)
Inheritance#
The custom checks are inherited and therefore can be overwritten by the subclass.
import pandas as pd
import pandera as pa
from pandera.typing import Index, Series
class Parent(pa.SchemaModel):
a: Series[int] = pa.Field(coerce=True)
@pa.check("a", name="foobar")
def check_a(cls, a: Series[int]) -> Series[bool]:
return a < 100
class Child(Parent):
a: Series[int] = pa.Field(coerce=False)
@pa.check("a", name="foobar")
def check_a(cls, a: Series[int]) -> Series[bool]:
return a > 100
is_a_coerce = Child.to_schema().columns["a"].coerce
print(f"coerce: {is_a_coerce}")
coerce: False
df = pd.DataFrame({"a": [1, 2, 3]})
print(Child.validate(df))
Traceback (most recent call last):
...
pandera.errors.SchemaError: <Schema Column: 'a' type=DataType(int64)> failed element-wise validator 0:
<Check foobar>
failure cases:
index failure_case
0 0 1
1 1 2
2 2 3
Aliases#
SchemaModel
supports columns which are not valid python variable names via the argument
alias of Field
.
Checks must reference the aliased names.
import pandera as pa
import pandas as pd
class Schema(pa.SchemaModel):
col_2020: pa.typing.Series[int] = pa.Field(alias=2020)
idx: pa.typing.Index[int] = pa.Field(alias="_idx", check_name=True)
@pa.check(2020)
def int_column_lt_100(cls, series):
return series < 100
df = pd.DataFrame({2020: [99]}, index=[0])
df.index.name = "_idx"
print(Schema.validate(df))
2020
_idx
0 99
(New in 0.6.2) The alias is respected when using the class attribute to get the underlying pd.DataFrame column name or index level name.
print(Schema.col_2020)
2020
Very similar to the example above, you can also use the variable name directly within the class scope, and it will respect the alias.
Note
To access a variable from the class scope, you need to make it a class attribute,
and therefore assign it a default Field
.
import pandera as pa
import pandas as pd
class Schema(pa.SchemaModel):
a: pa.typing.Series[int] = pa.Field()
col_2020: pa.typing.Series[int] = pa.Field(alias=2020)
@pa.check(col_2020)
def int_column_lt_100(cls, series):
return series < 100
@pa.check(a)
def int_column_gt_100(cls, series):
return series > 100
df = pd.DataFrame({2020: [99], "a": [101]})
print(Schema.validate(df))
2020 a
0 99 101
Series Schemas#
The SeriesSchema
class allows for the validation of pandas
Series
objects, and are very similar to columns and
indexes described in DataFrameSchemas.
import pandas as pd
import pandera as pa
# specify multiple validators
schema = pa.SeriesSchema(
str,
checks=[
pa.Check(lambda s: s.str.startswith("foo")),
pa.Check(lambda s: s.str.endswith("bar")),
pa.Check(lambda x: len(x) > 3, element_wise=True)
],
nullable=False,
unique=False,
name="my_series")
validated_series = schema.validate(
pd.Series(["foobar", "foobar", "foobar"], name="my_series"))
print(validated_series)
0 foobar
1 foobar
2 foobar
Name: my_series, dtype: object
Checks#
Checking column properties#
Check
objects accept a function as a required argument, which is
expected to take a pa.Series
input and output a boolean
or a Series
of boolean values. For the check to pass, all of the elements in the boolean
series must evaluate to True
, for example:
import pandera as pa
check_lt_10 = pa.Check(lambda s: s <= 10)
schema = pa.DataFrameSchema({"column1": pa.Column(int, check_lt_10)})
schema.validate(pd.DataFrame({"column1": range(10)}))
Multiple checks can be applied to a column:
schema = pa.DataFrameSchema({
"column2": pa.Column(str, [
pa.Check(lambda s: s.str.startswith("value")),
pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
]),
})
Built-in Checks#
For common validation tasks, built-in checks are available in pandera
.
import pandera as pa
from pandera import Column, Check, DataFrameSchema
schema = DataFrameSchema({
"small_values": Column(float, Check.less_than(100)),
"one_to_three": Column(int, Check.isin([1, 2, 3])),
"phone_number": Column(str, Check.str_matches(r'^[a-z0-9-]+$')),
})
See the Check
API reference for a complete list of built-in checks.
Vectorized vs. Element-wise Checks#
By default, Check
objects operate on pd.Series
objects. If you want to make atomic checks for each element in the Column, then
you can provide the element_wise=True
keyword argument:
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema({
"a": pa.Column(
int,
checks=[
# a vectorized check that returns a bool
pa.Check(lambda s: s.mean() > 5, element_wise=False),
# a vectorized check that returns a boolean series
pa.Check(lambda s: s > 0, element_wise=False),
# an element-wise check that returns a bool
pa.Check(lambda x: x > 0, element_wise=True),
]
),
})
df = pd.DataFrame({"a": [4, 4, 5, 6, 6, 7, 8, 9]})
schema.validate(df)
element_wise == False
by default so that you can take advantage of the
speed gains provided by the pd.Series
API by writing vectorized
checks.
Handling Null Values#
By default, pandera
drops null values before passing the objects to
validate into the check function. For Series
objects null elements are
dropped (this also applies to columns), and for DataFrame
objects, rows
with any null value are dropped.
If you want to check the properties of a pandas data structure while preserving
null values, specify Check(..., ignore_na=False)
when defining a check.
Note that this is different from the nullable
argument in Column
objects, which simply checks for null values in a column.
Column Check Groups#
Column
checks support grouping by a different column so that you
can make assertions about subsets of the column of interest. This
changes the function signature of the Check
function so that its
input is a dict where keys are the group names and values are subsets of the
series being validated.
Specifying groupby
as a column name, list of column names, or
callable changes the expected signature of the Check
function argument to:
Callable[Dict[Any, pd.Series] -> Union[bool, pd.Series]
where the dict keys are the discrete keys in the groupby
columns.
In the example below we define a DataFrameSchema
with column checks
for height_in_feet
using a single column, multiple columns, and a more
complex groupby function that creates a new column age_less_than_15
on the
fly.
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema({
"height_in_feet": pa.Column(
float, [
# groupby as a single column
pa.Check(
lambda g: g[False].mean() > 6,
groupby="age_less_than_20"),
# define multiple groupby columns
pa.Check(
lambda g: g[(True, "F")].sum() == 9.1,
groupby=["age_less_than_20", "sex"]),
# groupby as a callable with signature:
# (DataFrame) -> DataFrameGroupBy
pa.Check(
lambda g: g[(False, "M")].median() == 6.75,
groupby=lambda df: (
df.assign(age_less_than_15=lambda d: d["age"] < 15)
.groupby(["age_less_than_15", "sex"]))),
]),
"age": pa.Column(int, pa.Check(lambda s: s > 0)),
"age_less_than_20": pa.Column(bool),
"sex": pa.Column(str, pa.Check(lambda s: s.isin(["M", "F"])))
})
df = (
pd.DataFrame({
"height_in_feet": [6.5, 7, 6.1, 5.1, 4],
"age": [25, 30, 21, 18, 13],
"sex": ["M", "M", "F", "F", "F"]
})
.assign(age_less_than_20=lambda x: x["age"] < 20)
)
schema.validate(df)
Wide Checks#
pandera
is primarily designed to operate on long-form data (commonly known
as tidy data), where each row
is an observation and each column is an attribute associated with an
observation.
However, pandera
also supports checks on wide-form data to operate across
columns in a DataFrame
. For example, if you want to make assertions about
height
across two groups, the tidy dataset and schema might look like this:
import pandas as pd
import pandera as pa
df = pd.DataFrame({
"height": [5.6, 6.4, 4.0, 7.1],
"group": ["A", "B", "A", "B"],
})
schema = pa.DataFrameSchema({
"height": pa.Column(
float,
pa.Check(lambda g: g["A"].mean() < g["B"].mean(), groupby="group")
),
"group": pa.Column(str)
})
schema.validate(df)
Whereas the equivalent wide-form schema would look like this:
df = pd.DataFrame({
"height_A": [5.6, 4.0],
"height_B": [6.4, 7.1],
})
schema = pa.DataFrameSchema(
columns={
"height_A": pa.Column(float),
"height_B": pa.Column(float),
},
# define checks at the DataFrameSchema-level
checks=pa.Check(
lambda df: df["height_A"].mean() < df["height_B"].mean()
)
)
schema.validate(df)
You can see that when checks are supplied to the DataFrameSchema
checks
key-word argument, the check function should expect a pandas DataFrame
and
should return a bool
, a Series
of booleans, or a DataFrame
of
boolean values.
Raise UserWarning on Check Failure#
In some cases, you might want to raise a UserWarning
and continue execution
of your program. The Check
and Hypothesis
classes and their built-in
methods support the keyword argument raise_warning
, which is False
by default. If set to True
, the check will raise a UserWarning
instead
of raising a SchemaError
exception.
Note
Use this feature carefully! If the check is for informational purposes and
not critical for data integrity then use raise_warning=True
. However,
if the assumptions expressed in a Check
are necessary conditions to
considering your data valid, do not set this option to true.
One scenario where you’d want to do this would be in a data pipeline that does some preprocessing, checks for normality in certain columns, and writes the resulting dataset to a table. In this case, you want to see if your normality assumptions are not fulfilled by certain columns, but you still want the resulting table for further analysis.
import warnings
import numpy as np
import pandas as pd
import pandera as pa
from scipy.stats import normaltest
np.random.seed(1000)
df = pd.DataFrame({
"var1": np.random.normal(loc=0, scale=1, size=1000),
"var2": np.random.uniform(low=0, high=10, size=1000),
})
normal_check = pa.Hypothesis(
test=normaltest,
samples="normal_variable",
# null hypotheses: sample comes from a normal distribution. The
# relationship function checks if we cannot reject the null hypothesis,
# i.e. the p-value is greater or equal to alpha.
relationship=lambda stat, pvalue, alpha=0.05: pvalue >= alpha,
error="normality test",
raise_warning=True,
)
schema = pa.DataFrameSchema(
columns={
"var1": pa.Column(checks=normal_check),
"var2": pa.Column(checks=normal_check),
}
)
# catch and print warnings
with warnings.catch_warnings(record=True) as caught_warnings:
warnings.simplefilter("always")
validated_df = schema(df)
for warning in caught_warnings:
print(warning.message)
<Schema Column(name=var2, type=None)> failed series or dataframe validator 0:
<Check _hypothesis_check: normality test>
Registering Custom Checks#
pandera
now offers an interface to register custom checks functions so
that they’re available in the Check
namespace. See
the extensions document for more information.
Hypothesis Testing#
pandera
enables you to perform statistical hypothesis tests on your data.
Note
The hypothesis feature requires a pandera installation with hypotheses
dependency set. See the installation instructions for
more details.
Overview#
The Hypothesis
class defines built in methods,
which can be called as in this example of a two-sample t-test:
import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema, Check, Hypothesis
from scipy import stats
df = (
pd.DataFrame({
"height_in_feet": [6.5, 7, 6.1, 5.1, 4],
"sex": ["M", "M", "F", "F", "F"]
})
)
schema = DataFrameSchema({
"height_in_feet": Column(
float, [
Hypothesis.two_sample_ttest(
sample1="M",
sample2="F",
groupby="sex",
relationship="greater_than",
alpha=0.05,
equal_var=True),
]),
"sex": Column(str)
})
schema.validate(df)
Traceback (most recent call last):
...
pandera.SchemaError: <Schema Column: 'height_in_feet' type=float64> failed series validator 0: hypothesis_check: failed two sample ttest between 'M' and 'F'
You can also define custom hypotheses by passing in functions to the
test
and relationship
arguments.
The test
function takes as input one or multiple array-like objects
and should return a stat
, which is the test statistic, and pvalue
for
assessing statistical significance. It also takes key-word arguments supplied
by the test_kwargs
dict when initializing a Hypothesis
object.
The relationship
function should take all of the outputs of test
as
positional arguments, in addition to key-word arguments supplied by the
relationship_kwargs
dict.
Here’s an implementation of the two-sample t-test that uses the scipy implementation:
def two_sample_ttest(array1, array2):
# the "height_in_feet" series is first grouped by "sex" and then
# passed into the custom `test` function as two separate arrays in the
# order specified in the `samples` argument.
return stats.ttest_ind(array1, array2)
def null_relationship(stat, pvalue, alpha=0.01):
return pvalue / 2 >= alpha
schema = DataFrameSchema({
"height_in_feet": Column(
float, [
Hypothesis(
test=two_sample_ttest,
samples=["M", "F"],
groupby="sex",
relationship=null_relationship,
relationship_kwargs={"alpha": 0.05}
)
]),
"sex": Column(str, checks=Check.isin(["M", "F"]))
})
schema.validate(df)
Wide Hypotheses#
pandera
is primarily designed to operate on long-form data (commonly known
as tidy data), where each row
is an observation and columns are attributes associated with the observation.
However, pandera
also supports hypothesis testing on wide-form data to
operate across columns in a DataFrame
.
For example, if you want to make assertions about height
across two groups,
the tidy dataset and schema might look like this:
import pandas as pd
import pandera as pa
from pandera import Check, DataFrameSchema, Column, Hypothesis
df = pd.DataFrame({
"height": [5.6, 7.5, 4.0, 7.9],
"group": ["A", "B", "A", "B"],
})
schema = DataFrameSchema({
"height": Column(
float, Hypothesis.two_sample_ttest(
"A", "B",
groupby="group",
relationship="less_than",
alpha=0.05
)
),
"group": Column(str, Check(lambda s: s.isin(["A", "B"])))
})
schema.validate(df)
The equivalent wide-form schema would look like this:
import pandas as pd
import pandera as pa
from pandera import DataFrameSchema, Column, Hypothesis
df = pd.DataFrame({
"height_A": [5.6, 4.0],
"height_B": [7.5, 7.9],
})
schema = DataFrameSchema(
columns={
"height_A": Column(Float),
"height_B": Column(Float),
},
# define checks at the DataFrameSchema-level
checks=Hypothesis.two_sample_ttest(
"height_A", "height_B",
relationship="less_than",
alpha=0.05
)
)
schema.validate(df)
Pandera Data Types#
new in 0.7.0
Motivations#
Pandera defines its own interface for data types in order to abstract the specifics of dataframe-like data structures in the python ecosystem, such as Apache Spark, Apache Arrow and xarray.
Note
In the following section Pandera Data Type
refers to a
pandera.dtypes.DataType
object whereas native data type
refers
to data types used by third-party libraries that Pandera supports (e.g. pandas).
Most of the time, it is transparent to end users since pandera columns and indexes accept native data types. However, it is possible to extend the pandera interface by:
modifying the data type check performed during schema validation.
modifying the behavior of the coerce argument for
DataFrameSchema
.adding your own custom data types.
DataType basics#
All pandera data types inherit from pandera.dtypes.DataType
and must
be hashable.
A data type implements three key methods:
pandera.dtypes.DataType.check()
which validates that data types are equivalent.pandera.dtypes.DataType.coerce()
which coerces a data container (e.g.pandas.Series
) to the data type.The dunder method
__str__()
which should output the native alias. For examplestr(pandera.Float64) == "float64"
For pandera’s validation methods to be aware of a data type, it has to be
registered with the targeted engine via pandera.engines.engine.Engine.register_dtype()
.
An engine is in charge of mapping a pandera DataType
with a native data type counterpart belonging to a third-party library. The mapping
can be queried with pandera.engines.engine.Engine.dtype()
.
As of pandera 0.7.0
, only the pandas Engine
is supported.
Example#
Let’s extend pandas.BooleanDtype
coercion to handle the string
literals "True"
and "False"
.
import pandas as pd
import pandera as pa
from pandera import dtypes
from pandera.engines import pandas_engine
@pandas_engine.Engine.register_dtype # step 1
@dtypes.immutable # step 2
class LiteralBool(pandas_engine.BOOL): # step 3
def coerce(self, series: pd.Series) -> pd.Series:
"""Coerce a pandas.Series to boolean types."""
if pd.api.types.is_string_dtype(series):
series = series.replace({"True": 1, "False": 0})
return series.astype("boolean")
data = pd.Series(["True", "False"], name="literal_bools")
# step 4
print(
pa.SeriesSchema(LiteralBool(), coerce=True, name="literal_bools")
.validate(data)
.dtype
)
boolean
The example above performs the following steps:
Register the data type with the pandas engine.
pandera.dtypes.immutable()
creates an immutable (and hashable)dataclass()
.Inherit
pandera.engines.pandas_engine.BOOL
, which is the pandera representation ofpandas.BooleanDtype
. This is not mandatory but it makes our life easier by having already implemented all the required methods.Check that our new data type can coerce the string literals.
So far we did not override the default behavior:
import pandera as pa
pa.SeriesSchema("boolean", coerce=True).validate(data)
Traceback (most recent call last):
...
pandera.errors.SchemaError: Error while coercing 'literal_bools' to type boolean: Need to pass bool-like values
To completely replace the default BOOL
,
we need to supply all the equivalent representations to
register_dtype()
. Behind the scenes, when
pa.SeriesSchema("boolean")
is called the corresponding pandera data type
is looked up using pandera.engines.engine.Engine.dtype()
.
print(f"before: {pandas_engine.Engine.dtype('boolean').__class__}")
@pandas_engine.Engine.register_dtype(
equivalents=["boolean", pd.BooleanDtype, pd.BooleanDtype()],
)
@dtypes.immutable
class LiteralBool(pandas_engine.BOOL):
def coerce(self, series: pd.Series) -> pd.Series:
"""Coerce a pandas.Series to boolean types."""
if pd.api.types.is_string_dtype(series):
series = series.replace({"True": 1, "False": 0})
return series.astype("boolean")
print(f"after: {pandas_engine.Engine.dtype('boolean').__class__}")
for dtype in ["boolean", pd.BooleanDtype, pd.BooleanDtype()]:
pa.SeriesSchema(dtype, coerce=True).validate(data)
before: <class 'pandera.engines.pandas_engine.BOOL'>
after: <class 'LiteralBool'>
Note
For convenience, we specified both pd.BooleanDtype
and
pd.BooleanDtype()
as equivalents. That gives us more flexibility in
what pandera schemas can recognize (see last for-loop above).
Parametrized data types#
Some data types can be parametrized. One common example is
pandas.CategoricalDtype
.
The equivalents
argument of
register_dtype()
does not handle
this situation but will automatically register a classmethod()
with
signature from_parametrized_dtype(cls, equivalent:...)
if the decorated
DataType
defines it. The equivalent
argument must
be type-annotated because it is leveraged to dispatch the input of
dtype
to the appropriate
from_parametrized_dtype
class method.
For example, here is a snippet from pandera.engines.pandas_engine.Category
:
import pandas as pd
from pandera import dtypes
@classmethod
def from_parametrized_dtype(
cls, cat: Union[dtypes.Category, pd.CategoricalDtype]
):
"""Convert a categorical to
a Pandera :class:`pandera.dtypes.pandas_engine.Category`."""
return cls(categories=cat.categories, ordered=cat.ordered) # type: ignore
Note
The dispatch mechanism relies on functools.singledispatch()
.
Unlike the built-in implementation, typing.Union
is recognized.
Defining the coerce_value
method#
For pandera datatypes to understand how to correctly report coercion errors, it needs to know how to coerce an individual value into the specified type.
All pandas
data types are supported: numpy
-based datatypes use the
underlying numpy dtype to coerce an individual value. The pandas
-native
datatypes like CategoricalDtype
and BooleanDtype
are also supported.
As an example of a special-cased coerce_value
implementation, see the
source code for pandera.engines.pandas_engine.Category.coerce_value()
:
def coerce_value(self, value: Any) -> Any:
"""Coerce an value to a particular type."""
if value not in self.categories: # type: ignore
raise TypeError(
f"value {value} cannot be coerced to type {self.type}"
)
return value
Logical data types#
Taking inspiration from the visions project, pandera provides an interface for defining logical data types.
Physical types represent the actual, underlying representation of the data.
e.g.: Int8
, Float32
, String
, etc., whereas logical types represent the
abstracted understanding of that data. e.g.: IPs
, URLs
, paths
, etc.
Validating a logical data type consists of validating the supporting physical data type (see Motivations) and a check on actual values. For example, an IP address data type would validate that:
The data container type is a
String
.The actual values are well-formed addresses.
Non-native Pandas dtype can also be wrapped in a numpy.object_
and verified
using the data, since the object dtype alone is not enough to verify the
correctness. An example would be the standard decimal.Decimal
class that can be
validated via the pandera DataType Decimal
.
To implement a logical data type, you just need to implement the method
pandera.dtypes.DataType.check()
and make use of the data_container
argument to
perform checks on the values of the data.
For example, you can create an IPAddress
datatype that inherits from the numpy string
physical type, thereby storing the values as strings, and checks whether the values actually
match an IP address regular expression.
import re
from typing import Optional, Iterable, Union
@pandas_engine.Engine.register_dtype
@dtypes.immutable
class IPAddress(pandas_engine.NpString):
def check(
self,
pandera_dtype: dtypes.DataType,
data_container: Optional[pd.Series] = None,
) -> Union[bool, Iterable[bool]]:
# ensure that the data container's data type is a string,
# using the parent class's check implementation
correct_type = super().check(pandera_dtype)
if not correct_type:
return correct_type
# ensure the filepaths actually exist locally
exp = re.compile(r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})")
return data_container.map(lambda x: exp.match(x) is not None)
def __str__(self) -> str:
return str(self.__class__.__name__)
def __repr__(self) -> str:
return f"DataType({self})"
schema = pa.DataFrameSchema(columns={"ips": pa.Column(IPAddress)})
schema.validate(pd.DataFrame({"ips": ["0.0.0.0", "0.0.0.1", "0.0.0.a"]}))
Traceback (most recent call last):
...
pandera.errors.SchemaError: expected series 'ips' to have type IPAddress:
failure cases:
index failure_case
0 2 0.0.0.a
Decorators for Pipeline Integration#
If you have an existing data pipeline that uses pandas data structures,
you can use the check_input()
and check_output()
decorators
to easily check function arguments or returned variables from existing
functions.
Check Input#
Validates input pandas DataFrame/Series before entering the wrapped function.
import pandas as pd
import pandera as pa
from pandera import DataFrameSchema, Column, Check, check_input
df = pd.DataFrame({
"column1": [1, 4, 0, 10, 9],
"column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
})
in_schema = DataFrameSchema({
"column1": Column(int,
Check(lambda x: 0 <= x <= 10, element_wise=True)),
"column2": Column(float, Check(lambda x: x < -1.2)),
})
# by default, check_input assumes that the first argument is
# dataframe/series.
@check_input(in_schema)
def preprocessor(dataframe):
dataframe["column3"] = dataframe["column1"] + dataframe["column2"]
return dataframe
preprocessed_df = preprocessor(df)
print(preprocessed_df)
column1 column2 column3
0 1 -1.3 -0.3
1 4 -1.4 2.6
2 0 -2.9 -2.9
3 10 -10.1 -0.1
4 9 -20.4 -11.4
You can also provide the argument name as a string
@check_input(in_schema, "dataframe")
def preprocessor(dataframe):
...
Or an integer representing the index in the positional arguments.
@check_input(in_schema, 1)
def preprocessor(foo, dataframe):
...
Check Output#
The same as check_input
, but this decorator checks the output
DataFrame/Series of the decorated function.
import pandas as pd
import pandera as pa
from pandera import DataFrameSchema, Column, Check, check_output
preprocessed_df = pd.DataFrame({
"column1": [1, 4, 0, 10, 9],
})
# assert that all elements in "column1" are zero
out_schema = DataFrameSchema({
"column1": Column(int, Check(lambda x: x == 0))
})
# by default assumes that the pandas DataFrame/Schema is the only output
@check_output(out_schema)
def zero_column_1(df):
df["column1"] = 0
return df
# you can also specify in the index of the argument if the output is list-like
@check_output(out_schema, 1)
def zero_column_1_arg(df):
df["column1"] = 0
return "foobar", df
# or the key containing the data structure to verify if the output is dict-like
@check_output(out_schema, "out_df")
def zero_column_1_dict(df):
df["column1"] = 0
return {"out_df": df, "out_str": "foobar"}
# for more complex outputs, you can specify a function
@check_output(out_schema, lambda x: x[1]["out_df"])
def zero_column_1_custom(df):
df["column1"] = 0
return ("foobar", {"out_df": df})
zero_column_1(preprocessed_df)
zero_column_1_arg(preprocessed_df)
zero_column_1_dict(preprocessed_df)
zero_column_1_custom(preprocessed_df)
Check IO#
For convenience, you can also use the check_io()
decorator where you can specify input and output schemas more concisely:
import pandas as pd
import pandera as pa
from pandera import DataFrameSchema, Column, Check, check_input
df = pd.DataFrame({
"column1": [1, 4, 0, 10, 9],
"column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
})
in_schema = DataFrameSchema({
"column1": Column(int),
"column2": Column(float),
})
out_schema = in_schema.add_columns({"column3": Column(float)})
@pa.check_io(df1=in_schema, df2=in_schema, out=out_schema)
def preprocessor(df1, df2):
return (df1 + df2).assign(column3=lambda x: x.column1 + x.column2)
preprocessed_df = preprocessor(df, df)
print(preprocessed_df)
column1 column2 column3
0 2 -2.6 -0.6
1 8 -2.8 5.2
2 0 -5.8 -5.8
3 20 -20.2 -0.2
4 18 -40.8 -22.8
Decorate Functions and Coroutines#
All pandera decorators work on synchronous as well as asynchronous code, on both bound and unbound functions/coroutines. For example, one can use the same decorators on:
sync/async functions
sync/async methods
sync/async class methods
sync/async static methods
All decorators work on sync/async regular/class/static methods of metaclasses as well.
import pandera as pa
from pandera.typing import DataFrame, Series
class Schema(pa.SchemaModel):
col1: Series[int]
class Config:
strict = True
@pa.check_types
async def coroutine(df: DataFrame[Schema]) -> DataFrame[Schema]:
return df
@pa.check_types
async def function(df: DataFrame[Schema]) -> DataFrame[Schema]:
return df
class SomeClass:
@pa.check_output(Schema.to_schema())
async def regular_coroutine(self, df) -> DataFrame[Schema]:
return df
@classmethod
@pa.check_input(Schema.to_schema(), "df")
async def class_coroutine(cls, df):
return Schema.validate(df)
@staticmethod
@pa.check_io(df=Schema.to_schema(), out=Schema.to_schema())
def static_method(df):
return df
Schema Inference#
New in version 0.4.0
With simple use cases, writing a schema definition manually is pretty straight-forward with pandera. However, it can get tedious to do this with dataframes that have many columns of various data types.
To help you handle these cases, the infer_schema()
function enables
you to quickly infer a draft schema from a pandas dataframe or series. Below
is a simple example:
import pandas as pd
import pandera as pa
from pandera import Check, Column, DataFrameSchema
df = pd.DataFrame({
"column1": [5, 10, 20],
"column2": ["a", "b", "c"],
"column3": pd.to_datetime(["2010", "2011", "2012"]),
})
schema = pa.infer_schema(df)
print(schema)
<Schema DataFrameSchema(
columns={
'column1': <Schema Column(name=column1, type=DataType(int64))>
'column2': <Schema Column(name=column2, type=DataType(object))>
'column3': <Schema Column(name=column3, type=DataType(datetime64[ns]))>
},
checks=[],
coerce=True,
dtype=None,
index=<Schema Index(name=None, type=DataType(int64))>,
strict=False
name=None,
ordered=False,
unique_column_names=False
)>
These inferred schemas are rough drafts that shouldn’t be used for validation without modification. You can modify the inferred schema to obtain the schema definition that you’re satisfied with.
For DataFrameSchema
objects, the following methods create
modified copies of the schema:
For SeriesSchema
objects:
set_checks()
The section below describes two workflows for persisting and modifying an inferred schema.
Schema Persistence#
The schema persistence feature requires a pandera installation with the io
extension. See the installation instructions for more
details.
There are two ways of persisting schemas, inferred or otherwise.
Write to a Python script#
You can also write your schema to a python script with to_script()
:
# supply a file-like object, Path, or str to write to a file. If not
# specified, to_script will output the code as a string.
schema_script = schema.to_script()
print(schema_script)
from pandas import Timestamp
from pandera import DataFrameSchema, Column, Check, Index, MultiIndex
schema = DataFrameSchema(
columns={
"column1": Column(
dtype=pandera.engines.numpy_engine.Int64,
checks=[
Check.greater_than_or_equal_to(min_value=5.0),
Check.less_than_or_equal_to(max_value=20.0),
],
nullable=False,
unique=False,
coerce=False,
required=True,
regex=False,
description=None,
title=None,
),
"column2": Column(
dtype=pandera.engines.numpy_engine.Object,
checks=None,
nullable=False,
unique=False,
coerce=False,
required=True,
regex=False,
description=None,
title=None,
),
"column3": Column(
dtype=pandera.engines.pandas_engine.DateTime,
checks=[
Check.greater_than_or_equal_to(
min_value=Timestamp("2010-01-01 00:00:00")
),
Check.less_than_or_equal_to(
max_value=Timestamp("2012-01-01 00:00:00")
),
],
nullable=False,
unique=False,
coerce=False,
required=True,
regex=False,
description=None,
title=None,
),
},
index=Index(
dtype=pandera.engines.numpy_engine.Int64,
checks=[
Check.greater_than_or_equal_to(min_value=0.0),
Check.less_than_or_equal_to(max_value=2.0),
],
nullable=False,
coerce=False,
name=None,
description=None,
title=None,
),
coerce=True,
strict=False,
name=None,
)
As a python script, you can iterate on an inferred schema and use it to validate data once you are satisfied with your schema definition.
Write to YAML#
You can also write the schema object to a yaml file with to_yaml()
,
and you can then read it into memory with from_yaml()
. The
to_yaml()
and from_yaml()
is a convenience method for this functionality.
# supply a file-like object, Path, or str to write to a file. If not
# specified, to_yaml will output a yaml string.
yaml_schema = schema.to_yaml()
print(yaml_schema.replace(f"{pa.__version__}", "{PANDERA_VERSION}"))
schema_type: dataframe
version: {PANDERA_VERSION}
columns:
column1:
title: null
description: null
dtype: int64
nullable: false
checks:
greater_than_or_equal_to: 5.0
less_than_or_equal_to: 20.0
unique: false
coerce: false
required: true
regex: false
column2:
title: null
description: null
dtype: object
nullable: false
checks: null
unique: false
coerce: false
required: true
regex: false
column3:
title: null
description: null
dtype: datetime64[ns]
nullable: false
checks:
greater_than_or_equal_to: '2010-01-01 00:00:00'
less_than_or_equal_to: '2012-01-01 00:00:00'
unique: false
coerce: false
required: true
regex: false
checks: null
index:
- title: null
description: null
dtype: int64
nullable: false
checks:
greater_than_or_equal_to: 0.0
less_than_or_equal_to: 2.0
name: null
unique: false
coerce: false
coerce: true
strict: false
unique: null
ordered: false
You can edit this yaml file to modify the schema. For example, you can specify
new column names under the column
key, and the respective values map onto
key-word arguments in the Column
class.
Note
Currently, only built-in Check
methods are supported under the
checks
key.
Write to JSON#
Finally, you can also write the schema object to a json file with to_json()
,
and you can then read it into memory with from_json()
. The
to_json()
and from_json()
is a convenience method for this functionality.
# supply a file-like object, Path, or str to write to a file. If not
# specified, to_yaml will output a yaml string.
json_schema = schema.to_json(indent=4)
print(json_schema.replace(f"{pa.__version__}", "{PANDERA_VERSION}"))
{
"schema_type": "dataframe",
"version": "{PANDERA_VERSION}",
"columns": {
"column1": {
"title": null,
"description": null,
"dtype": "int64",
"nullable": false,
"checks": {
"greater_than_or_equal_to": 5.0,
"less_than_or_equal_to": 20.0
},
"unique": false,
"coerce": false,
"required": true,
"regex": false
},
"column2": {
"title": null,
"description": null,
"dtype": "object",
"nullable": false,
"checks": null,
"unique": false,
"coerce": false,
"required": true,
"regex": false
},
"column3": {
"title": null,
"description": null,
"dtype": "datetime64[ns]",
"nullable": false,
"checks": {
"greater_than_or_equal_to": "2010-01-01 00:00:00",
"less_than_or_equal_to": "2012-01-01 00:00:00"
},
"unique": false,
"coerce": false,
"required": true,
"regex": false
}
},
"checks": null,
"index": [
{
"title": null,
"description": null,
"dtype": "int64",
"nullable": false,
"checks": {
"greater_than_or_equal_to": 0.0,
"less_than_or_equal_to": 2.0
},
"name": null,
"unique": false,
"coerce": false
}
],
"coerce": true,
"strict": false,
"unique": null,
"ordered": false
}
You can edit this json file to update the schema as needed, and then load
it back into a pandera schema object with from_json()
or
from_json()
.
Lazy Validation#
New in version 0.4.0
By default, when you call the validate
method on schema or schema component
objects, a SchemaError
is raised as soon as one of the
assumptions specified in the schema is falsified. For example, for a
DataFrameSchema
object, the following situations will raise an
exception:
a column specified in the schema is not present in the dataframe.
if
strict=True
, a column in the dataframe is not specified in the schema.the
data type
does not match.if
coerce=True
, the dataframe column cannot be coerced into the specifieddata type
.the
Check
specified in one of the columns returnsFalse
or a boolean series containing at least oneFalse
value.
For example:
import pandas as pd
import pandera as pa
from pandera import Check, Column, DataFrameSchema
df = pd.DataFrame({"column": ["a", "b", "c"]})
schema = pa.DataFrameSchema({"column": Column(int)})
schema.validate(df)
Traceback (most recent call last):
...
SchemaError: expected series 'column' to have type int64, got object
For more complex cases, it is useful to see all of the errors raised during
the validate
call so that you can debug the causes of errors on different
columns and checks. The lazy
keyword argument in the validate
method
of all schemas and schema components gives you the option of doing just this:
import pandas as pd
import pandera as pa
from pandera import Check, Column, DataFrameSchema
schema = pa.DataFrameSchema(
columns={
"int_column": Column(int),
"float_column": Column(float, Check.greater_than(0)),
"str_column": Column(str, Check.equal_to("a")),
"date_column": Column(pa.DateTime),
},
strict=True
)
df = pd.DataFrame({
"int_column": ["a", "b", "c"],
"float_column": [0, 1, 2],
"str_column": ["a", "b", "d"],
"unknown_column": None,
})
schema.validate(df, lazy=True)
Traceback (most recent call last):
...
pandera.errors.SchemaErrors: A total of 5 schema errors were found.
Error Counts
------------
- column_not_in_schema: 1
- column_not_in_dataframe: 1
- schema_component_check: 3
Schema Error Summary
--------------------
failure_cases n_failure_cases
schema_context column check
DataFrameSchema <NA> column_in_dataframe [date_column] 1
column_in_schema [unknown_column] 1
Column float_column dtype('float64') [int64] 1
int_column dtype('int64') [object] 1
str_column equal_to(a) [b, d] 2
Usage Tip
---------
Directly inspect all errors by catching the exception:
```
try:
schema.validate(dataframe, lazy=True)
except SchemaErrors as err:
err.failure_cases # dataframe of schema errors
err.data # invalid dataframe
```
As you can see from the output above, a SchemaErrors
exception is raised with a summary of the error counts and failure cases
caught by the schema. You can also see from the Usage Tip that you can
catch these errors and inspect the failure cases in a more granular form:
try:
schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
print("Schema errors and failure cases:")
print(err.failure_cases)
print("\nDataFrame object that failed validation:")
print(err.data)
Schema errors and failure cases:
schema_context column check check_number \
0 DataFrameSchema None column_in_schema None
1 DataFrameSchema None column_in_dataframe None
2 Column int_column dtype('int64') None
3 Column float_column dtype('float64') None
4 Column float_column greater_than(0) 0
5 Column str_column equal_to(a) 0
6 Column str_column equal_to(a) 0
failure_case index
0 unknown_column None
1 date_column None
2 object None
3 int64 None
4 0 0
5 b 1
6 d 2
DataFrame object that failed validation:
int_column float_column str_column unknown_column
0 a 0 a None
1 b 1 b None
2 c 2 d None
Data Synthesis Strategies#
new in 0.6.0
pandera
provides a utility for generating synthetic data purely from
pandera schema or schema component objects. Under the hood, the schema metadata
is collected to create a data-generating strategy using
hypothesis, which is a
property-based testing library.
Basic Usage#
Once you’ve defined a schema, it’s easy to generate examples:
import pandera as pa
schema = pa.DataFrameSchema(
{
"column1": pa.Column(int, pa.Check.eq(10)),
"column2": pa.Column(float, pa.Check.eq(0.25)),
"column3": pa.Column(str, pa.Check.eq("foo")),
}
)
print(schema.example(size=3))
column1 column2 column3
0 10 0.25 foo
1 10 0.25 foo
2 10 0.25 foo
Note that here we’ve constrained the specific values in each column using
Check
s in order to make the data generation process
deterministic for documentation purposes.
Usage in Unit Tests#
The example
method is available for all schemas and schema components, and
is primarily meant to be used interactively. It could be used in a script to
generate test cases, but hypothesis
recommends against doing this and
instead using the strategy
method to create a hypothesis
strategy
that can be used in pytest
unit tests.
import hypothesis
def processing_fn(df):
return df.assign(column4=df.column1 * df.column2)
@hypothesis.given(schema.strategy(size=5))
def test_processing_fn(dataframe):
result = processing_fn(dataframe)
assert "column4" in result
The above example is trivial, but you get the idea! Schema objects can create
a strategy
that can then be collected by a pytest
runner. We could also run the tests explicitly ourselves, or run it as a
unittest.TestCase
. For more information on testing with hypothesis, see the
hypothesis quick start guide.
A more practical example involves using
schema transformations. We can modify
the function above to make sure that processing_fn
actually outputs the
correct result:
out_schema = schema.add_columns({"column4": pa.Column(float)})
@pa.check_output(out_schema)
def processing_fn(df):
return df.assign(column4=df.column1 * df.column2)
@hypothesis.given(schema.strategy(size=5))
def test_processing_fn(dataframe):
processing_fn(dataframe)
Now the test_processing_fn
simply becomes an execution test, raising a
SchemaError
if processing_fn
doesn’t add
column4
to the dataframe.
Strategies and Examples from Schema Models#
You can also use the class-based API to generate examples. Here’s the equivalent schema model for the above examples:
from pandera.typing import Series, DataFrame
class InSchema(pa.SchemaModel):
column1: Series[int] = pa.Field(eq=10)
column2: Series[float] = pa.Field(eq=0.25)
column3: Series[str] = pa.Field(eq="foo")
class OutSchema(InSchema):
column4: Series[float]
@pa.check_types
def processing_fn(df: DataFrame[InSchema]) -> DataFrame[OutSchema]:
return df.assign(column4=df.column1 * df.column2)
@hypothesis.given(InSchema.strategy(size=5))
def test_processing_fn(dataframe):
processing_fn(dataframe)
Checks as Constraints#
As you may have noticed in the first example, Check
s
further constrain the data synthesized from a strategy. Without checks, the
example
method would simply generate any value of the specified type. You
can specify multiple checks on a column and pandera
should be able to
generate valid data under those constraints.
schema_multiple_checks = pa.DataFrameSchema({
"column1": pa.Column(
float, checks=[
pa.Check.gt(0),
pa.Check.lt(1e10),
pa.Check.notin([-100, -10, 0]),
]
)
})
for _ in range(5):
# generate 10 rows of the dataframe
sample_data = schema_multiple_checks.example(size=3)
# validate the sampled data
schema_multiple_checks(sample_data)
One caveat here is that it’s up to you to define a set of checks that are
jointly satisfiable. If not, an Unsatisfiable
exception will be raised:
schema_multiple_checks = pa.DataFrameSchema({
"column1": pa.Column(
float, checks=[
# nonsensical constraints
pa.Check.gt(0),
pa.Check.lt(-10),
]
)
})
schema_multiple_checks.example(size=3)
Traceback (most recent call last):
...
Unsatisfiable: Unable to satisfy assumptions of hypothesis example_generating_inner_function.
Check Strategy Chaining#
If you specify multiple checks for a particular column, this is what happens under the hood:
The first check in the list is the base strategy, which
hypothesis
uses to generate data.All subsequent checks filter the values generated by the previous strategy such that it fulfills the constraints of current check.
To optimize efficiency of the data-generation procedure, make sure to specify the most restrictive constraint of a column as the base strategy and build other constraints on top of it.
In-line Custom Checks#
One of the strengths of pandera
is its flexibility with regard to defining
custom checks on the fly:
schema_inline_check = pa.DataFrameSchema({
"col": pa.Column(str, pa.Check(lambda s: s.isin({"foo", "bar"})))
})
One of the disadvantages of this is that the fallback strategy is to simply
apply the check to the generated data, which can be highly inefficient. In this
case, hypothesis
will generate strings and try to find examples of strings
that are in the set {"foo", "bar"}
, which will be very slow and most likely
raise an Unsatisfiable
exception. To get around this limitation, you can
register custom checks and define strategies that correspond to them.
Defining Custom Strategies#
All built-in Check
s are associated with a data
synthesis strategy. You can define your own data synthesis strategies by using
the extensions API to register a custom check function with
a corresponding strategy.
Extensions#
new in 0.6.0
Registering Custom Check Methods#
One of the strengths of pandera
is its flexibility in enabling you to
defining in-line custom checks on the fly:
import pandera as pa
# checks elements in a column/dataframe
element_wise_check = pa.Check(lambda x: x < 0, element_wise=True)
# applies the check function to a dataframe/series
vectorized_check = pa.Check(lambda series_or_df: series_or_df < 0)
However, there are two main disadvantages of schemas with inline custom checks:
they are not serializable with the IO interface.
you can’t use them to synthesize data because the checks are not associated with a
hypothesis
strategy.
pandera
now offers a way to register custom checks so that they’re
available in the Check
class as a check method. Here
let’s define a custom method that checks whether a pandas object contains
elements that lie within two values.
import pandera as pa
import pandera.extensions as extensions
import pandas as pd
@extensions.register_check_method(statistics=["min_value", "max_value"])
def is_between(pandas_obj, *, min_value, max_value):
return (min_value <= pandas_obj) & (pandas_obj <= max_value)
schema = pa.DataFrameSchema({
"col": pa.Column(int, pa.Check.is_between(min_value=1, max_value=10))
})
data = pd.DataFrame({"col": [1, 5, 10]})
print(schema(data))
col
0 1
1 5
2 10
As you can see, a custom check’s first argument is a pandas series or dataframe
by default (more on that later), followed by keyword-only arguments, specified
with the *
syntax.
The register_check_method()
requires you to
explicitly name the check statistics
via the keyword argument, which are
essentially the constraints placed by the check on the pandas data structure.
Specifying a Check Strategy#
To specify a check strategy with your custom check, you’ll need to install the strategies extension. First let’s look at a trivially simple example, where the check verifies whether a column is equal to a certain value:
def custom_equals(pandas_obj, *, value):
return pandas_obj == value
The corresponding strategy for this check would be:
from typing import Optional
import hypothesis
import pandera.strategies as st
def equals_strategy(
pandera_dtype: pa.DataType,
strategy: Optional[st.SearchStrategy] = None,
*,
value,
):
if strategy is None:
return st.pandas_dtype_strategy(
pandera_dtype, strategy=hypothesis.strategies.just(value),
)
return strategy.filter(lambda x: x == value)
As you may notice, the pandera
strategy interface is has two arguments
followed by keyword-only arguments that match the check function keyword-only
check statistics. The pandera_dtype
positional argument is useful for
ensuring the correct data type. In the above example, we’re using the
pandas_dtype_strategy()
strategy to make sure the
generated value
is of the correct data type.
The optional strategy
argument allows us to use the check strategy as a
base strategy or a chained strategy. There’s a detail that we’re
responsible for implementing in the strategy function body: we need to handle
two cases to account for strategy chaining:
when the strategy function is being used as a base strategy, i.e. when
strategy
isNone
when the strategy function is being chained from a previously-defined strategy, i.e. when
strategy
is notNone
.
Finally, to register the custom check with the strategy, use the
register_check_method()
decorator:
@extensions.register_check_method(
statistics=["value"], strategy=equals_strategy
)
def custom_equals(pandas_obj, *, value):
return pandas_obj == value
Let’s unpack what’s going in here. The custom_equals
function only has
a single statistic, which is the value
argument, which we’ve also specified
in register_check_method()
. This means that the
associated check strategy must match its keyword-only arguments.
Going back to our is_between
function example, here’s what the strategy
would look like:
def in_between_strategy(
pandera_dtype: pa.DataType,
strategy: Optional[st.SearchStrategy] = None,
*,
min_value,
max_value
):
if strategy is None:
return st.pandas_dtype_strategy(
pandera_dtype,
min_value=min_value,
max_value=max_value,
exclude_min=False,
exclude_max=False,
)
return strategy.filter(lambda x: min_value <= x <= max_value)
@extensions.register_check_method(
statistics=["min_value", "max_value"],
strategy=in_between_strategy,
)
def is_between_with_strat(pandas_obj, *, min_value, max_value):
return (min_value <= pandas_obj) & (pandas_obj <= max_value)
Check Types#
The extensions module also supports registering element-wise and groupby checks.
Element-wise Checks#
@extensions.register_check_method(
statistics=["val"],
check_type="element_wise",
)
def element_wise_equal_check(element, *, val):
return element == val
Note that the first argument of element_wise_equal_check
is a single
element in the column or dataframe.
Groupby Checks#
In this groupby check, we’re verifying that the values of one column for
group_a
are, on average, greater than those of group_b
:
from typing import Dict
@extensions.register_check_method(
statistics=["group_a", "group_b"],
check_type="groupby",
)
def groupby_check(dict_groups: Dict[str, pd.Series], *, group_a, group_b):
return dict_groups[group_a].mean() > dict_groups[group_b].mean()
data = pd.DataFrame({
"values": [20, 10, 1, 15],
"groups": list("xxyy"),
})
schema = pa.DataFrameSchema({
"values": pa.Column(
int,
pa.Check.groupby_check(group_a="x", group_b="y", groupby="groups"),
),
"groups": pa.Column(str),
})
print(schema(data))
values groups
0 20 x
1 10 x
2 1 y
3 15 y
Registered Custom Checks with the Class-based API#
Since registered checks are part of the Check
namespace,
you can also use custom checks with the class-based API:
from pandera.typing import Series
class Schema(pa.SchemaModel):
col1: Series[str] = pa.Field(custom_equals="value")
col2: Series[int] = pa.Field(is_between={"min_value": 0, "max_value": 10})
data = pd.DataFrame({
"col1": ["value"] * 5,
"col2": range(5)
})
print(Schema.validate(data))
col1 col2
0 value 0
1 value 1
2 value 2
3 value 3
4 value 4
DataFrame checks can be attached by using the Config class. Any field names that
do not conflict with existing fields of BaseConfig
and do not start
with an underscore (_
) are interpreted as the name of registered checks. If the value
is a tuple or dict, it is interpreted as the positional or keyword arguments of the check, and
as the first argument otherwise.
For example, to register zero, one, and two statistic dataframe checks one could do the following:
import pandera as pa
import pandera.extensions as extensions
import numpy as np
import pandas as pd
@extensions.register_check_method()
def is_small(df):
return sum(df.shape) < 1000
@extensions.register_check_method(statistics=["fraction"])
def total_missing_fraction_less_than(df, *, fraction: float):
return (1 - df.count().sum().item() / df.apply(len).sum().item()) < fraction
@extensions.register_check_method(statistics=["col_a", "col_b"])
def col_mean_a_greater_than_b(df, *, col_a: str, col_b: str):
return df[col_a].mean() > df[col_b].mean()
from pandera.typing import Series
class Schema(pa.SchemaModel):
col1: Series[float] = pa.Field(nullable=True, ignore_na=False)
col2: Series[float] = pa.Field(nullable=True, ignore_na=False)
class Config:
is_small = ()
total_missing_fraction_less_than = 0.6
col_mean_a_greater_than_b = {"col_a": "col2", "col_b": "col1"}
data = pd.DataFrame({
"col1": [float('nan')] * 3 + [0.5, 0.3, 0.1],
"col2": np.arange(6.),
})
print(Schema.validate(data))
col1 col2
0 NaN 0.0
1 NaN 1.0
2 NaN 2.0
3 0.5 3.0
4 0.3 4.0
5 0.1 5.0
Data Format Conversion#
new in 0.9.0
The class-based API provides configuration options for converting data to/from
supported serialization formats in the context of
check_types()
-decorated functions.
Note
Currently, pandera.typing.pandas.DataFrame
is the only data
type that supports this feature.
Consider this simple example:
import pandera as pa
from pandera.typing import DataFrame, Series
class InSchema(pa.SchemaModel):
str_col: Series[str] = pa.Field(unique=True, isin=[*"abcd"])
int_col: Series[int]
class OutSchema(InSchema):
float_col: pa.typing.Series[float]
@pa.check_types
def transform(df: DataFrame[InSchema]) -> DataFrame[OutSchema]:
return df.assign(float_col=1.1)
With the schema type annotations and
check_types()
decorator, the transform
function validates DataFrame inputs and outputs according to the InSchema
and OutSchema
definitions.
But what if your input data is serialized in parquet format, and you want to
read it into memory, validate the DataFrame, and then pass it to a downstream
function for further analysis? Similarly, what if you want the output of
transform
to be a list of dictionary records instead of a pandas DataFrame?
The to/from_format
Configuration Options#
To easily fulfill the use cases described above, you can implement the
read/write logic by hand, or you can configure schemas to do so. We can first
define a subclass of InSchema
with additional configuration so that our
transform
function can read data directly from parquet files or buffers:
class InSchemaParquet(InSchema):
class Config:
from_format = "parquet"
Then, we define subclass of OutSchema
to specify that transform
should output a list of dictionaries representing the rows of the output
dataframe.
class OutSchemaDict(OutSchema):
class Config:
to_format = "dict"
to_format_kwargs = {"orient": "records"}
Note that the {to/from}_format_kwargs
configuration option should be
supplied with a dictionary of key-word arguments to be passed into the
respective pandas to_{format}
method.
Finally, we redefine our transform
function:
@pa.check_types
def transform(df: DataFrame[InSchemaParquet]) -> DataFrame[OutSchemaDict]:
return df.assign(float_col=1.1)
We can test this out using a buffer to store the parquet file.
Note
A string or path-like object representing the filepath to a parquet file
would also be a valid input to transform
.
import io
import json
buffer = io.BytesIO()
data = pd.DataFrame({"str_col": [*"abc"], "int_col": range(3)})
data.to_parquet(buffer)
buffer.seek(0)
dict_output = transform(buffer)
print(json.dumps(dict_output, indent=4))
[
{
"str_col": "a",
"int_col": 0,
"float_col": 1.1
},
{
"str_col": "b",
"int_col": 1,
"float_col": 1.1
},
{
"str_col": "c",
"int_col": 2,
"float_col": 1.1
}
]
Takeaway#
Data Format Conversion using the {to/from}_format
configuration option
can modify the behavior of check_types()
-decorated
functions to convert input data from a particular serialization format into
a dataframe. Additionally, you can convert the output data from a dataframe to
potentially another format.
This dovetails well with the FastAPI Integration for validating the inputs and outputs of app endpoints.
Supported DataFrame Libraries#
Pandera started out as a pandas-specific dataframe validation library, and moving forward its core functionality will continue to support pandas. However, pandera’s adoption has resulted in the realization that it can be a much more powerful tool by supporting other dataframe-like formats.
Domain-specific Data Validation#
The pandas ecosystem provides support for domain-specific data manipulation, and by extension pandera can provide access to data types, methods, and data container types specific to these libraries.
An extension of pandas that adds geospatial data processing capabilities. |
Data Validation with GeoPandas#
new in 0.9.0
GeoPandas is an extension of Pandas that adds
support for geospatial data. You can use pandera to validate GeoDataFrame()
and GeoSeries()
objects directly. First, install
pandera
with the geopandas
extra:
pip install pandera[geopandas]
Then you can use pandera schemas to validate geodataframes. In the example
below we’ll use the class-based API to define a
SchemaModel
for validation.
import geopandas as gpd
import pandas as pd
import pandera as pa
from shapely.geometry import Polygon
geo_schema = pa.DataFrameSchema({
"geometry": pa.Column("geometry"),
"region": pa.Column(str),
})
geo_df = gpd.GeoDataFrame({
"geometry": [
Polygon(((0, 0), (0, 1), (1, 1), (1, 0))),
Polygon(((0, 0), (0, -1), (-1, -1), (-1, 0)))
],
"region": ["NA", "SA"]
})
print(geo_schema.validate(geo_df))
geometry region
0 POLYGON ((0.00000 0.00000, 0.00000 1.00000, 1.... NA
1 POLYGON ((0.00000 0.00000, 0.00000 -1.00000, -... SA
You can also use the GeometryDtype
data type in either instantiated or
un-instantiated form:
geo_schema = pa.DataFrameSchema({
"geometry": pa.Column(gpd.array.GeometryDtype),
# or
"geometry": pa.Column(gpd.array.GeometryDtype()),
})
If you want to validate-on-instantiation, you can use the
GeoDataFrame
generic type with the
schema model defined above:
from pandera.typing import Series
from pandera.typing.geopandas import GeoDataFrame, GeoSeries
class Schema(pa.SchemaModel):
geometry: GeoSeries
region: Series[str]
# create a geodataframe that's validated on object initialization
df = GeoDataFrame[Schema](
{
'geometry': [
Polygon(((0, 0), (0, 1), (1, 1), (1, 0))),
Polygon(((0, 0), (0, -1), (-1, -1), (-1, 0)))
],
'region': ['NA','SA']
}
)
print(df)
geometry region
0 POLYGON ((0.00000 0.00000, 0.00000 1.00000, 1.... NA
1 POLYGON ((0.00000 0.00000, 0.00000 -1.00000, -... SA
Scaling Up Data Validation#
Pandera provides multiple ways of scaling up data validation to dataframes that don’t fit into memory. Fortunately, pandera doesn’t have to re-invent the wheel. Standing on shoulders of giants, it integrates with the existing ecosystem of libraries that allow you to perform validations on out-of-memory dataframes.
Apply pandera schemas to Dask dataframe partitions. |
|
Apply pandera schemas to distributed dataframe partitions with Fugue. |
|
Koalas [Deprecated] |
A pandas drop-in replacement, distributed using a Spark backend. |
Exposes a |
|
A pandas drop-in replacement, distributed using a Ray or Dask backend. |
Data Validation with Dask#
new in 0.8.0
Dask is a distributed
compute framework that offers a pandas-like dataframe API.
You can use pandera to validate DataFrame()
and Series()
objects directly. First, install
pandera
with the dask
extra:
pip install pandera[dask]
Then you can use pandera schemas to validate dask dataframes. In the example
below we’ll use the class-based API to define a
SchemaModel
for validation.
import dask.dataframe as dd
import pandas as pd
import pandera as pa
from pandera.typing.dask import DataFrame, Series
class Schema(pa.SchemaModel):
state: Series[str]
city: Series[str]
price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})
ddf = dd.from_pandas(
pd.DataFrame(
{
'state': ['FL','FL','FL','CA','CA','CA'],
'city': [
'Orlando',
'Miami',
'Tampa',
'San Francisco',
'Los Angeles',
'San Diego',
],
'price': [8, 12, 10, 16, 20, 18],
}
),
npartitions=2
)
pandera_ddf = Schema(ddf)
print(pandera_ddf)
Dask DataFrame Structure:
state city price
npartitions=2
0 object object int64
3 ... ... ...
5 ... ... ...
Dask Name: validate, 2 graph layers
As you can see, passing the dask dataframe into Schema
will produce
another dask dataframe which hasn’t been evaluated yet. What this means is
that pandera will only validate when the dask graph is evaluated.
print(pandera_ddf.compute())
state city price
0 FL Orlando 8
1 FL Miami 12
2 FL Tampa 10
3 CA San Francisco 16
4 CA Los Angeles 20
5 CA San Diego 18
You can also use the check_types()
decorator to validate
dask dataframes at runtime:
@pa.check_types
def function(ddf: DataFrame[Schema]) -> DataFrame[Schema]:
return ddf[ddf["state"] == "CA"]
print(function(ddf).compute())
state city price
3 CA San Francisco 16
4 CA Los Angeles 20
5 CA San Diego 18
And of course, you can use the object-based API to validate dask dataframes:
schema = pa.DataFrameSchema({
"state": pa.Column(str),
"city": pa.Column(str),
"price": pa.Column(int, pa.Check.in_range(min_value=5, max_value=20))
})
print(schema(ddf).compute())
state city price
0 FL Orlando 8
1 FL Miami 12
2 FL Tampa 10
3 CA San Francisco 16
4 CA Los Angeles 20
5 CA San Diego 18
Data Validation with Fugue#
Validation on big data comes in two forms. The first is performing one set of
validations on data that doesn’t fit in memory. The second happens when a large dataset
is comprised of multiple groups that require different validations. In pandas semantics,
this would be the equivalent of a groupby-validate
operation. This section will cover
using pandera
for both of these scenarios.
Pandera
has support for Spark
and Dask
DataFrames through Modin
and
PySpark Pandas
. Another option for running pandera
on top of native Spark
or Dask
engines is Fugue . Fugue
is
an open source abstraction layer that ports Python
, pandas
, and SQL
code to
Spark
and Dask
. Operations will be applied on DataFrames natively, minimizing
overhead.
What is Fugue?#
Fugue
serves as an interface to distributed computing. Because of its non-invasive design,
existing Python
code can be scaled to a distributed setting without significant changes.
To run the example, Fugue
needs to installed separately. Using pip:
pip install fugue[spark]
This will also install PySpark
because of the spark
extra. Dask
is available
with the dask
extra.
Example#
In this example, a pandas DataFrame
is created with state
, city
and price
columns. Pandera
will be used to validate that the price
column values are within
a certain range.
import pandas as pd
data = pd.DataFrame(
{
'state': ['FL','FL','FL','CA','CA','CA'],
'city': [
'Orlando', 'Miami', 'Tampa', 'San Francisco', 'Los Angeles', 'San Diego'
],
'price': [8, 12, 10, 16, 20, 18],
}
)
print(data)
state city price
0 FL Orlando 8
1 FL Miami 12
2 FL Tampa 10
3 CA San Francisco 16
4 CA Los Angeles 20
5 CA San Diego 18
Validation is then applied using pandera. A price_validation
function is
created that runs the validation. None of this will be new.
from pandera import Column, DataFrameSchema, Check
price_check = DataFrameSchema(
{"price": Column(int, Check.in_range(min_value=5,max_value=20))}
)
def price_validation(data:pd.DataFrame) -> pd.DataFrame:
return price_check.validate(data)
The transform
function in Fugue
is the easiest way to use Fugue
with existing Python
functions as seen in the following code snippet. The first two arguments are the DataFrame
and
function to apply. The keyword argument schema
is required because schema is strictly enforced
in distributed settings. Here, the schema
is simply * because no new columns are added.
The last part of the transform
function is the engine
. Here, a SparkSession
object
is used to run the code on top of Spark
. For Dask, users can pass a string "dask"
or
can pass a Dask Client. Passing nothing uses the default pandas-based engine. Because we
passed a SparkSession in this example, the output is a Spark DataFrame.
from fugue import transform
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark_df = transform(data, price_validation, schema="*", engine=spark)
spark_df.show()
+-----+-------------+-----+
|state| city|price|
+-----+-------------+-----+
| FL| Orlando| 8|
| FL| Miami| 12|
| FL| Tampa| 10|
| CA|San Francisco| 16|
| CA| Los Angeles| 20|
| CA| San Diego| 18|
+-----+-------------+-----+
Validation by Partition#
There is an interesting use case that arises with bigger datasets. Frequently, there are logical
groupings of data that require different validations. In the earlier sample data, the
price range for the records with state
FL is lower than the range for the state
CA.
Two DataFrameSchema
will be created to reflect this. Notice their ranges
for the Check
differ.
price_check_FL = DataFrameSchema({
"price": Column(int, Check.in_range(min_value=7,max_value=13)),
})
price_check_CA = DataFrameSchema({
"price": Column(int, Check.in_range(min_value=15,max_value=21)),
})
price_checks = {'CA': price_check_CA, 'FL': price_check_FL}
A slight modification is needed to our price_validation
function. Fugue
will partition
the whole dataset into multiple pandas DataFrames
. Think of this as a groupby
. By the
time price_validation
is used, it only contains the data for one state
. The appropriate
DataFrameSchema
is pulled and then applied.
To partition our data by state
, all we need to do is pass it into the transform
function
through the partition
argument. This splits up the data across different workers before they
each run the price_validation
function. Again, this is like a groupby-validation.
def price_validation(df:pd.DataFrame) -> pd.DataFrame:
location = df['state'].iloc[0]
check = price_checks[location]
check.validate(df)
return df
spark_df = transform(data,
price_validation,
schema="*",
partition=dict(by="state"),
engine=spark)
spark_df.show()
SparkDataFrame
state:str|city:str |price:long
---------+---------------------------------------------------------+----------
CA |San Francisco |16
CA |Los Angeles |20
CA |San Diego |18
FL |Orlando |8
FL |Miami |12
FL |Tampa |10
Total count: 6
Note
Because operations in a distributed setting are applied per partition, statistical
validators will be applied on each partition rather than the global dataset. If no
partitioning scheme is specified, Spark
and Dask
use default partitions. Be
careful about using operations like mean, min, and max without partitioning beforehand.
All row-wise validations scale well with this set-up.
Returning Errors#
Pandera
will raise a SchemaError
by default that gets buried by the Spark error
messages. To return the errors as a DataFrame, we use can use the following approach. If
there are no errors in the data, it will just return an empty DataFrame.
To keep the errors for each partition, you can attach the partition key as a column in the returned DataFrame.
from pandera.errors import SchemaErrors
out_schema = "schema_context:str, column:str, check:str, \
check_number:int, failure_case:str, index:int"
out_columns = ["schema_context", "column", "check",
"check_number", "failure_case", "index"]
price_check = DataFrameSchema(
{"price": Column(int, Check.in_range(min_value=12,max_value=20))}
)
def price_validation(data:pd.DataFrame) -> pd.DataFrame:
try:
price_check.validate(data, lazy=True)
return pd.DataFrame(columns=out_columns)
except SchemaErrors as err:
return err.failure_cases
transform(data, price_validation, schema=out_schema, engine=spark).show()
+--------------+------+----------------+------------+------------+-----+
|schema_context|column| check|check_number|failure_case|index|
+--------------+------+----------------+------------+------------+-----+
| Column| price|in_range(12, 20)| 0| 8| 0|
| Column| price|in_range(12, 20)| 0| 10| 0|
+--------------+------+----------------+------------+------------+-----+
Data Validation with Koalas#
Note
Koalas has been deprecated since version 0.10.0. Please refer to the pyspark page for validating pyspark dataframes.
Data Validation with Pyspark ⭐️ (New)#
new in 0.10.0
Pyspark is a
distributed compute framework that offers a pandas drop-in replacement dataframe
implementation via the pyspark.pandas API .
You can use pandera to validate DataFrame()
and Series()
objects directly. First, install
pandera
with the pyspark
extra:
pip install pandera[pyspark]
Then you can use pandera schemas to validate pyspark dataframes. In the example
below we’ll use the class-based API to define a
SchemaModel
for validation.
import pyspark.pandas as ps
import pandas as pd
import pandera as pa
from pandera.typing.pyspark import DataFrame, Series
class Schema(pa.SchemaModel):
state: Series[str]
city: Series[str]
price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})
# create a pyspark.pandas dataframe that's validated on object initialization
df = DataFrame[Schema](
{
'state': ['FL','FL','FL','CA','CA','CA'],
'city': [
'Orlando',
'Miami',
'Tampa',
'San Francisco',
'Los Angeles',
'San Diego',
],
'price': [8, 12, 10, 16, 20, 18],
}
)
print(df)
state city price
0 FL Orlando 8
1 FL Miami 12
2 FL Tampa 10
3 CA San Francisco 16
4 CA Los Angeles 20
5 CA San Diego 18
You can also use the check_types()
decorator to validate
pyspark pandas dataframes at runtime:
@pa.check_types
def function(df: DataFrame[Schema]) -> DataFrame[Schema]:
return df[df["state"] == "CA"]
print(function(df))
state city price
3 CA San Francisco 16
4 CA Los Angeles 20
5 CA San Diego 18
And of course, you can use the object-based API to validate dask dataframes:
schema = pa.DataFrameSchema({
"state": pa.Column(str),
"city": pa.Column(str),
"price": pa.Column(int, pa.Check.in_range(min_value=5, max_value=20))
})
print(schema(df))
state city price
0 FL Orlando 8
1 FL Miami 12
2 FL Tampa 10
3 CA San Francisco 16
4 CA Los Angeles 20
5 CA San Diego 18
Data Validation with Modin#
new in 0.8.0
Modin is a distributed
compute framework that offers a pandas drop-in replacement dataframe
implementation. You can use pandera to validate DataFrame()
and Series()
objects directly. First, install
pandera
with the dask
extra:
pip install pandera[modin] # installs both ray and dask backends
pip install pandera[modin-ray] # only ray backend
pip install pandera[modin-dask] # only dask backend
Then you can use pandera schemas to validate modin dataframes. In the example
below we’ll use the class-based API to define a
SchemaModel
for validation.
import modin.pandas as pd
import pandas as pd
import pandera as pa
from pandera.typing.modin import DataFrame, Series
class Schema(pa.SchemaModel):
state: Series[str]
city: Series[str]
price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})
# create a modin dataframe that's validated on object initialization
df = DataFrame[Schema](
{
'state': ['FL','FL','FL','CA','CA','CA'],
'city': [
'Orlando',
'Miami',
'Tampa',
'San Francisco',
'Los Angeles',
'San Diego',
],
'price': [8, 12, 10, 16, 20, 18],
}
)
print(df)
state city price
0 FL Orlando 8
1 FL Miami 12
2 FL Tampa 10
3 CA San Francisco 16
4 CA Los Angeles 20
5 CA San Diego 18
You can also use the check_types()
decorator to validate
modin dataframes at runtime:
@pa.check_types
def function(df: DataFrame[Schema]) -> DataFrame[Schema]:
return df[df["state"] == "CA"]
print(function(df))
state city price
3 CA San Francisco 16
4 CA Los Angeles 20
5 CA San Diego 18
And of course, you can use the object-based API to validate dask dataframes:
schema = pa.DataFrameSchema({
"state": pa.Column(str),
"city": pa.Column(str),
"price": pa.Column(int, pa.Check.in_range(min_value=5, max_value=20))
})
print(schema(df))
state city price
0 FL Orlando 8
1 FL Miami 12
2 FL Tampa 10
3 CA San Francisco 16
4 CA Los Angeles 20
5 CA San Diego 18
Note
Don’t see a library that you want supported? Check out the github issues to see if that library is in the roadmap. If it isn’t, open up a new issue to add support for it!
Integrations#
Pandera ships with integrations with other tools in the Python ecosystem, with the goal of interoperating with libraries that you know and love.
Use pandera SchemaModels in your FastAPI app |
|
Convert frictionless schemas to pandera schemas |
|
Use the hypothesis library to generate valid data under your schema’s constraints. |
|
Type-lint your pandas and pandera code with mypy for static type safety [experimental 🧪] |
|
Use pandera SchemaModels when defining your pydantic BaseModels |
FastAPI#
new in 0.9.0
Since both FastAPI and Pandera integrates seamlessly with Pydantic, you can
use the SchemaModel
types to validate incoming
or outgoing data with respect to your API endpoints.
Using SchemaModels to Validate Endpoint Inputs and Outputs#
Suppose we want to process transactions, where each transaction has an
id
and cost
. We can model this with a pandera schema model:
# pylint: skip-file
from typing import Optional
from pydantic import BaseModel, Field
import pandera as pa
class Transactions(pa.SchemaModel):
id: pa.typing.Series[int]
cost: pa.typing.Series[float] = pa.Field(ge=0, le=1000)
class Config:
coerce = True
Also suppose that we expect our endpoint to add a name
to the transaction
data:
class TransactionsOut(Transactions):
id: pa.typing.Series[int]
cost: pa.typing.Series[float]
name: pa.typing.Series[str]
Let’s also assume that the output of the endpoint should be a list of dictionary
records containing the named transactions data. We can do this easily with the
to_format
option in the schema model BaseConfig
.
class TransactionsDictOut(TransactionsOut):
class Config:
to_format = "dict"
to_format_kwargs = {"orient": "records"}
Note that the to_format_kwargs
is a dictionary of key-word arguments
to be passed into the respective pandas to_{format}
method.
Next we’ll create a FastAPI app and define a /transactions/
POST endpoint:
from fastapi import FastAPI, File
from pandera.typing import DataFrame
app = FastAPI()
@app.post("/transactions/", response_model=DataFrame[TransactionsDictOut])
def create_transactions(transactions: DataFrame[Transactions]):
output = transactions.assign(name="foo")
... # do other stuff, e.g. update backend database with transactions
return output
Reading File Uploads#
Similar to the TransactionsDictOut
example to convert dataframes to a
particular format as an endpoint response, pandera also provides a
from_format
schema model configuration option to read a dataframe from
a particular serialization format.
class TransactionsParquet(Transactions):
class Config:
from_format = "parquet"
Let’s also define a response model for the /file/
upload endpoint:
class TransactionsJsonOut(TransactionsOut):
class Config:
to_format = "json"
to_format_kwargs = {"orient": "records"}
class ResponseModel(BaseModel):
filename: str
df: pa.typing.DataFrame[TransactionsJsonOut]
In the next example, we use the pandera
UploadFile
type to upload a parquet file
to the /file/
POST endpoint and return a response containing the filename
and the modified data in json format.
from pandera.typing.fastapi import UploadFile
@app.post("/file/", response_model=ResponseModel)
def create_upload_file(
file: UploadFile[DataFrame[TransactionsParquet]] = File(...),
):
return {
"filename": file.filename,
"df": file.data.assign(name="foo"),
}
Pandera’s UploadFile
type is a subclass of FastAPI’s
UploadFile
but it exposes a .data
property containing the pandera-validated dataframe.
Takeaway#
With the FastAPI and Pandera integration, you can use Pandera
SchemaModel
types to validate the dataframe inputs
and outputs of your FastAPI endpoints.
Reading Third-Party Schema#
new in 0.7.0
Pandera now accepts schema from other data validation frameworks. This requires
a pandera installation with the io
extension; please see the
installation instructions for more details.
Frictionless Data Schema#
Note
Please see the Frictionless schema documentation for more information on this standard.
- pandera.io.from_frictionless_schema(schema)[source]#
Create a
DataFrameSchema
from either a frictionless json/yaml schema file saved on disk, or from a frictionless schema already loaded into memory.Each field from the frictionless schema will be converted to a pandera column specification using
FrictionlessFieldParser
to map field characteristics to pandera column specifications.- Parameters
schema (
Union
[str
,Path
,Dict
,Schema
]) – the frictionless schema object (or a string/Path to the location on disk of a schema specification) to parse.- Return type
- Returns
dataframe schema with frictionless field specs converted to pandera column checks and constraints for use as normal.
- Example
Here, we’re defining a very basic frictionless schema in memory before parsing it and then querying the resulting
DataFrameSchema
object as per any other Pandera schema:>>> from pandera.io import from_frictionless_schema >>> >>> FRICTIONLESS_SCHEMA = { ... "fields": [ ... { ... "name": "column_1", ... "type": "integer", ... "constraints": {"minimum": 10, "maximum": 99} ... }, ... { ... "name": "column_2", ... "type": "string", ... "constraints": {"maxLength": 10, "pattern": "\S+"} ... }, ... ], ... "primaryKey": "column_1" ... } >>> schema = from_frictionless_schema(FRICTIONLESS_SCHEMA) >>> schema.columns["column_1"].checks [<Check in_range: in_range(10, 99)>] >>> schema.columns["column_1"].required True >>> schema.columns["column_1"].unique True >>> schema.columns["column_2"].checks [<Check str_length: str_length(None, 10)>, <Check str_matches: str_matches(re.compile('^\\S+$'))>]
under the hood, this uses the FrictionlessFieldParser
class
to parse each frictionless field (column):
- class pandera.io.FrictionlessFieldParser(field, primary_keys)[source]#
Parses frictionless data schema field specifications so we can convert them to an equivalent Pandera
Column
schema.For this implementation, we are using field names, constraints and types but leaving other frictionless parameters out (e.g. foreign keys, type formats, titles, descriptions).
- Parameters
field – a field object from a frictionless schema.
primary_keys – the primary keys from a frictionless schema. These are used to ensure primary key fields are treated properly - no duplicates, no missing values etc.
- property checks: Optional[Dict]#
Convert a set of frictionless schema field constraints into checks.
This parses the standard set of frictionless constraints which can be found here and maps them into the equivalent pandera checks.
- property coerce: bool#
Determine whether values within this field should be coerced.
This currently returns
True
for all fields within a frictionless schema.- Return type
- property dtype: str#
Determine what type of field this is, so we can feed that into
DataType
. If no type is specified in the frictionless schema, we default to string values.- Return type
- Returns
the pandas-compatible representation of this field type as a string.
- property nullable: bool#
Determine whether this field can contain missing values.
If a field is a primary key, this will return
False
.- Return type
- property regex: bool#
Determine whether this field name should be used for regex matches.
This currently returns
False
for all fields within a frictionless schema.- Return type
Mypy#
new in 0.8.0
Pandera integrates with mypy to provide static type-linting of dataframes, relying on pandas-stubs for typing information.
pip install pandera[mypy]
Then enable the plugin in your mypy.ini
or setug.cfg
file:
[mypy]
plugins = pandera.mypy
Note
Mypy static type-linting is supported for only pandas dataframes.
Warning
This functionality is experimental 🧪. Since the
pandas-stubs type stub
annotations don’t always match the official
pandas effort to support type annotations),
installing the `pandera[mypy]
extra may yield false positives in your
pandas code, many of which are are documented in tests/mypy/modules
.
We encourage beta users to file an issue
if they find any false positives or negatives being reported by mypy
.
A list of such issues can be found here.
In the example below, we define a few schemas to see how type-linting with pandera works.
from typing import cast
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
class Schema(pa.SchemaModel):
id: Series[int]
name: Series[str]
class SchemaOut(pa.SchemaModel):
age: Series[int]
class AnotherSchema(pa.SchemaModel):
id: Series[int]
first_name: Series[str]
The mypy linter will complain if the output type of the function body doesn’t match the function’s return signature.
def fn(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(age=30).pipe(DataFrame[SchemaOut]) # mypy okay
def fn_pipe_incorrect_type(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(age=30).pipe(DataFrame[AnotherSchema]) # mypy error
# error: Argument 1 to "pipe" of "NDFrame" has incompatible type "Type[DataFrame[Any]]"; # noqa
# expected "Union[Callable[..., DataFrame[SchemaOut]], Tuple[Callable[..., DataFrame[SchemaOut]], str]]" [arg-type] # noqa
def fn_assign_copy(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(age=30) # mypy error
# error: Incompatible return value type (got "pandas.core.frame.DataFrame",
# expected "pandera.typing.pandas.DataFrame[SchemaOut]") [return-value]
It’ll also complain if the input type doesn’t match the expected input type.
Note that we’re using the pandera.typing.pandas.DataFrame
generic
type to define dataframes that are validated against the
SchemaModel
type variable on initialization.
schema_df = DataFrame[Schema]({"id": [1], "name": ["foo"]})
pandas_df = pd.DataFrame({"id": [1], "name": ["foo"]})
another_df = DataFrame[AnotherSchema]({"id": [1], "first_name": ["foo"]})
fn(schema_df) # mypy okay
fn(pandas_df) # mypy error
# error: Argument 1 to "fn" has incompatible type "pandas.core.frame.DataFrame"; # noqa
# expected "pandera.typing.pandas.DataFrame[Schema]" [arg-type]
fn(another_df) # mypy error
# error: Argument 1 to "fn" has incompatible type "DataFrame[AnotherSchema]";
# expected "DataFrame[Schema]" [arg-type]
To make mypy happy with respect to the return type, you can either initialize a dataframe of the expected type:
def fn_pipe_dataframe(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(age=30).pipe(DataFrame[SchemaOut]) # mypy okay
Note
If you use the approach above with the check_types()
decorator, pandera will do its best to not to validate the dataframe twice
if it’s already been initialized with the
DataFrame[Schema](**data)
syntax.
Or use typing.cast()
to indicate to mypy that the return value of
the function is of the correct type.
def fn_cast_dataframe(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return cast(DataFrame[SchemaOut], df.assign(age=30)) # mypy okay
Limitations#
An important caveat to static type-linting with pandera dataframe types is that,
since pandas dataframes are mutable objects, there’s no way for mypy
to
know whether a mutated instance of a
SchemaModel
-typed dataframe has the correct
contents. Fortunately, we can simply rely on the check_types()
decorator to verify that the output dataframe is valid.
Consider the examples below:
def fn_pipe_dataframe(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(age=30).pipe(DataFrame[SchemaOut]) # mypy okay
def fn_cast_dataframe(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return cast(DataFrame[SchemaOut], df.assign(age=30)) # mypy okay
@pa.check_types
def fn_mutate_inplace(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
out = df.assign(age=30).pipe(DataFrame[SchemaOut])
out.drop(["age"], axis=1, inplace=True)
return out # okay for mypy, pandera raises error
@pa.check_types
def fn_assign_and_get_index(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(foo=30).iloc[:3] # okay for mypy, pandera raises error
Even though the outputs of these functions are incorrect, mypy doesn’t catch
the error during static type-linting but pandera will raise a
SchemaError
or SchemaErrors
exception at runtime, depending on whether you’re doing
lazy validation or not.
@pa.check_types
def fn_cast_dataframe_invalid(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return cast(
DataFrame[SchemaOut], df
) # okay for mypy, pandera raises error
Pydantic#
new in 0.8.0
Using Pandera Schemas in Pydantic Models#
SchemaModel
is fully compatible with
pydantic. You can specify
a SchemaModel
in a pydantic BaseModel
as you would
any other field:
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
import pydantic
class SimpleSchema(pa.SchemaModel):
str_col: Series[str] = pa.Field(unique=True)
class PydanticModel(pydantic.BaseModel):
x: int
df: DataFrame[SimpleSchema]
valid_df = pd.DataFrame({"str_col": ["hello", "world"]})
PydanticModel(x=1, df=valid_df)
invalid_df = pd.DataFrame({"str_col": ["hello", "hello"]})
PydanticModel(x=1, df=invalid_df)
Traceback (most recent call last):
...
ValidationError: 1 validation error for PydanticModel
df
series 'str_col' contains duplicate values:
1 hello
Name: str_col, dtype: object (type=value_error)
Other pandera components are also compatible with pydantic:
Note
The SeriesSchema
, DataFrameSchema
and schema_components
types
validates the type of a schema object, e.g. if your pydantic
BaseModel
contained a schema object, not a pandas
object.
Using Pydantic Models in Pandera Schemas#
new in 0.10.0
You can also use a pydantic BaseModel
in a pandera schema. Suppose you had
a Record
model:
from pydantic import BaseModel
import pandera as pa
class Record(BaseModel):
name: str
xcoord: str
ycoord: int
The PydanticModel
datatype enables you to
specify the Record
model as a row-wise type.
import pandas as pd
from pandera.engines.pandas_engine import PydanticModel
class PydanticSchema(pa.SchemaModel):
"""Pandera schema using the pydantic model."""
class Config:
"""Config with dataframe-level data type."""
dtype = PydanticModel(Record)
coerce = True # this is required, otherwise a SchemaInitError is raised
Note
By combining dtype=PydanticModel(...)
and coerce=True
, pandera will
apply the pydantic model validation process to each row of the dataframe,
converting the model back to a dictionary with the BaseModel.dict() method.
The equivalent pandera schema would look like this:
class PanderaSchema(pa.SchemaModel):
"""Pandera schema that's equivalent to PydanticSchema."""
name: pa.typing.Series[str]
xcoord: pa.typing.Series[int]
ycoord: pa.typing.Series[int]
Note
Since the PydanticModel
datatype
applies the BaseModel
constructor to each row of the dataframe, using
PydanticModel
might not scale well with larger datasets.
If you want to help benchmark, consider contributing a benchmark script
Note
Don’t see a library that you want supported? Check out the github issues to see if that library is in the roadmap. If it isn’t, open up a new issue to add support for it!
API#
The core objects for defining pandera schemas |
|
Data types for type checking and coercion. |
|
Alternative class-based API for defining pandera schemas. |
|
Decorators for integrating pandera schemas with python functions. |
|
Bootstrap schemas from real data |
|
Utility functions for reading/writing schemas |
|
Module of functions for generating data from schemas. |
|
Utility functions for extending pandera functionality |
|
Pandera-specific exceptions |
Core#
Schemas#
A light-weight pandas DataFrame validator. |
|
Series validator. |
Schema Components#
Validate types and properties of DataFrame columns. |
|
Validate types and properties of a DataFrame Index. |
|
Validate types and properties of a DataFrame MultiIndex. |
Checks#
Check a pandas Series or DataFrame for certain properties. |
|
Special type of |
Data Types#
Library-agnostic dtypes#
Base class of all Pandera data types. |
|
Semantic representation of a boolean data type. |
|
Semantic representation of a timestamp data type. |
|
alias of |
|
Semantic representation of a delta time data type. |
|
Semantic representation of a categorical data type. |
|
Semantic representation of a floating data type. |
|
Semantic representation of a floating data type stored in 16 bits. |
|
Semantic representation of a floating data type stored in 32 bits. |
|
Semantic representation of a floating data type stored in 64 bits. |
|
Semantic representation of a floating data type stored in 128 bits. |
|
Semantic representation of an integer data type. |
|
Semantic representation of an integer data type stored in 8 bits. |
|
Semantic representation of an integer data type stored in 16 bits. |
|
Semantic representation of an integer data type stored in 32 bits. |
|
Semantic representation of an integer data type stored in 64 bits. |
|
Semantic representation of an unsigned integer data type. |
|
Semantic representation of an unsigned integer data type stored in 8 bits. |
|
Semantic representation of an unsigned integer data type stored in 16 bits. |
|
Semantic representation of an unsigned integer data type stored in 32 bits. |
|
Semantic representation of an unsigned integer data type stored in 64 bits. |
|
Semantic representation of a complex number data type. |
|
Semantic representation of a complex number data type stored in 64 bits. |
|
Semantic representation of a complex number data type stored in 128 bits. |
|
Semantic representation of a complex number data type stored in 256 bits. |
|
Semantic representation of a decimal data type. |
|
Semantic representation of a string data type. |
Pandas Dtypes#
Listed here for compatibility with pandera versions < 0.7. Passing native pandas dtypes to pandera components is preferred.
Semantic representation of a |
|
Semantic representation of a |
|
Semantic representation of a |
|
Semantic representation of a |
|
Semantic representation of a |
|
Semantic representation of a |
|
Semantic representation of a |
|
Semantic representation of a |
|
Semantic representation of a |
|
Semantic representation of a |
|
Semantic representation of a |
|
Semantic representation of a |
|
Semantic representation of a date data type. |
|
Semantic representation of a |
|
Semantic representation of a |
GeoPandas Dtypes#
new in 0.9.0
Pydantic Dtypes#
new in 0.10.0
A pydantic model datatype applying to rows in a dataframe. |
Utility functions#
Returns True if first argument is lower/equal in DataType hierarchy. |
|
Return True if |
|
Return True if |
|
Return True if |
|
Return True if |
|
Return True if |
|
Return True if |
|
Return True if |
|
Return True if |
|
Return True if |
|
|
Engines#
Base Engine metaclass. |
|
Numpy data type engine. |
|
Pandas data type engine. |
Schema Models#
Schema Model#
|
Definition of a |
Model Components#
|
Used to provide extra information about a field of a SchemaModel. |
|
Decorator to make SchemaModel method a column/index check function. |
Decorator to make SchemaModel method a dataframe-wide check function. |
Typing#
Typing module. |
Config#
Define DataFrameSchema-wide options. |
Decorators#
Validate function argument when function is called. |
|
Validate function output. |
|
Check schema for multiple inputs and outputs. |
|
Validate function inputs and output based on type annotations. |
Schema Inference#
Infer schema for pandas DataFrame or Series object. |
IO Utilities#
The io
module and built-in Hypothesis
checks require a pandera
installation with the corresponding extension, see the
installation instructions for more details.
Create |
|
Write |
|
Write |
Data Synthesis Strategies#
Generate synthetic data from a schema definition. |
Extensions#
pandera API extensions |
Errors#
Raised when object does not pass schema validation constraints. |
|
Raised when multiple schema are lazily collected into one error. |
|
Raised when schema initialization fails. |
|
Raised when schema definition is invalid on object validation. |
Contributing#
Whether you are a novice or experienced software developer, all contributions and suggestions are welcome!
Getting Started#
If you are looking to contribute to the pandera codebase, the best place to start is the GitHub “issues” tab. This is also a great place for filing bug reports and making suggestions for ways in which we can improve the code and documentation.
Contributing to the Codebase#
The code is hosted on GitHub, so you will need to use Git to clone the project and make changes to the codebase.
First create your own fork of pandera, then clone it:
# replace <my-username> with your github username
git clone https://github.com/<my-username>/pandera.git
Once you’ve obtained a copy of the code, create a development environment that’s separate from your existing Python environment so that you can make and test changes without compromising your own work environment.
An excellent guide on setting up python environments can be found
here.
Pandera offers a environment.yml
to set up a conda-based environment and
requirements-dev.txt
for a virtualenv.
Environment Setup#
Option 1: miniconda
Setup#
Install miniconda, then run:
conda create -n pandera-dev python=3.8 # or any python version 3.7+
conda env update -n pandera-dev -f environment.yml
conda activate pandera-dev
pip install -e .
Option 2: virtualenv
Setup#
pip install virtualenv
virtualenv .venv/pandera-dev
source .venv/pandera-dev/bin/activate
pip install -r requirements-dev.txt
pip install -e .
Run Tests#
pytest tests
Build Documentation Locally#
make docs
Adding New Dependencies#
To add new dependencies to the project, make sure to alter the environment.yml file. Then to sync the dependencies from the environment.yml file to the requirements-dev.txt run the following command
python scripts/generate_pip_deps_from_conda.py
Moreover to add new dependecies in setup.py, it is necessary to add it to the _extras_require dictionary.
Set up pre-commit
#
This project uses pre-commit to ensure that code
standard checks pass locally before pushing to the remote project repo. Follow
the installation instructions, then
set up hooks with pre-commit install
. After, black
, pylint
and mypy
checks should be run with every commit.
Make sure everything is working correctly by running
pre-commit run --all
Making Changes#
Before making changes to the codebase or documentation, create a new branch with:
git checkout -b <my-branch>
We recommend following the branch-naming convention described in Making Pull Requests.
Run the Full Test Suite Locally#
Before submitting your changes for review, make sure to check that your changes do not break any tests by running:
# option 1: if you're working with conda (recommended)
$ make nox-conda
# option 2: if you're working with virtualenv
$ make nox
Option 2 assumes that you have python environments for all of the versions that pandera supports.
Using mamba
(optional)#
You can also use mamba, which is a faster
implementation of miniconda,
to run the nox
test suite. Simply install it via conda-forge, and
make nox-conda
should use it under the hood.
$ conda install -c conda-forge mamba
$ make nox-conda
Project Releases#
Releases are organized under milestones, which are be associated with a corresponding branch. This project uses semantic versioning, and we recommend prioritizing issues associated with the next release.
Contributing Documentation#
Maybe the easiest, fastest, and most useful way to contribute to this project (and any other project) is to contribute documentation. If you find an API within the project that doesn’t have an example or description, or could be clearer in its explanation, contribute yours!
You can also find issues for improving documentation under the docs label. If you have ideas for documentation improvements, you can create a new issue here
This project uses Sphinx for auto-documentation and RST syntax for docstrings. Once you have the code downloaded and you find something that is in need of some TLD, take a look at the Sphinx documentation or well-documented examples within the codebase for guidance on contributing.
You can build the html documentation by running nox -s docs
. The built
documentation can be found in docs/_build
.
Contributing Bugfixes#
Bugs are reported under the bug label, so if you find a bug create a new issue here.
Contributing Enhancements#
New feature issues can be found under the enhancements label. You can request a feature by creating a new issue here.
Making Pull Requests#
Once your changes are ready to be submitted, make sure to push your changes to your fork of the GitHub repo before creating a pull request. Depending on the type of issue the pull request is resolving, your pull request should merge onto the appropriate branch:
Bugfixes#
branch naming convention:
bugfix/<issue number>
orbugfix/<bugfix-name>
pull request to:
dev
Documentation#
branch naming convention:
docs/<issue number>
ordocs/<doc-name>
pull request to:
release/x.x.x
branch if specified in the issue milestone, otherwisedev
Enhancements#
branch naming convention:
feature/<issue number>
orfeature/<bugfix-name>
pull request to:
release/x.x.x
branch if specified in the issue milestone, otherwisedev
We will review your changes, and might ask you to make additional changes before it is finally ready to merge. However, once it’s ready, we will merge it, and you will have successfully contributed to the codebase!
Questions, Ideas, General Discussion#
Head on over to the discussion
section if you have questions or ideas, want to show off something that you
did with pandera
, or want to discuss a topic related to the project.
Dataframe Schema Style Guides#
We have guidelines regarding dataframe and schema styles that are encouraged for each pull request:
If specifying a single column DataFrame, this can be expressed as a one-liner:
DataFrameSchema({"col1": Column(...)})
If specifying one column with multiple lines, or multiple columns:
DataFrameSchema( { "col1": Column( int, checks=[ Check(...), Check(...), ] ), } )
If specifying columns with additional arguments that fit in one line:
DataFrameSchema( {"a": Column(int, nullable=True)}, strict=True )
If specifying columns with additional arguments that don’t fit in one line:
DataFrameSchema( { "a": Column( int, nullable=True, coerce=True, ... ), "b": Column( ..., ) }, strict=True)
Deprecation policy#
This project adopts a rolling policy regarding the minimum supported version of its dependencies, based on NEP 29:
Python: 42 months
NumPy: 24 months
Pandas: 18 months
This means the latest minor (X.Y) version from N months prior. Patch versions (x.y.Z) are not pinned, and only the latest available at the moment of publishing the xarray release is guaranteed to work.
How to Cite#
If you use pandera
in the context of academic or industry research, please
consider citing the paper and/or software package.
Paper#
@InProceedings{ niels_bantilan-proc-scipy-2020,
author = { {N}iels {B}antilan },
title = { pandera: {S}tatistical {D}ata {V}alidation of {P}andas {D}ataframes },
booktitle = { {P}roceedings of the 19th {P}ython in {S}cience {C}onference },
pages = { 116 - 124 },
year = { 2020 },
editor = { {M}eghann {A}garwal and {C}hris {C}alloway and {D}illon {N}iederhut and {D}avid {S}hupe },
doi = { 10.25080/Majora-342d178e-010 }
}
Software Package#
License and Credits#
pandera
is licensed under the MIT license.
and is written and maintained by Niels Bantilan (niels@pandera.ci)