The Open-source Framework for Precision Data Testing#
Data validation for scientists, engineers, and analysts seeking correctness.
pandera
is a Union.ai
open source project that provides a flexible and expressive API for performing data
validation on dataframe-like objects to make data processing pipelines more readable
and robust.
Dataframes contain information that pandera
explicitly validates at runtime.
This is useful in production-critical data pipelines or reproducible research
settings. With pandera
, you can:
Define a schema once and use it to validate different dataframe types including pandas, dask, modin, and pyspark.pandas.
Check the types and properties of columns in a
pd.DataFrame
or values in apd.Series
.Perform more complex statistical validation like hypothesis testing.
Seamlessly integrate with existing data analysis/processing pipelines via function decorators.
Define dataframe models with the class-based API with pydantic-style syntax and validate dataframes using the typing syntax.
Synthesize data from schema objects for property-based testing with pandas data structures.
Lazily Validate dataframes so that all validation rules are executed before raising an error.
Integrate with a rich ecosystem of python tools like pydantic, fastapi and mypy.
Install#
Install with pip
:
pip install pandera
Or conda
:
conda install -c conda-forge pandera
Extras#
Installing additional functionality:
pip install pandera[hypotheses] # hypothesis checks
pip install pandera[io] # yaml/script schema io utilities
pip install pandera[strategies] # data synthesis strategies
pip install pandera[mypy] # enable static type-linting of pandas
pip install pandera[fastapi] # fastapi integration
pip install pandera[dask] # validate dask dataframes
pip install pandera[pyspark] # validate pyspark dataframes
pip install pandera[modin] # validate modin dataframes
pip install pandera[modin-ray] # validate modin dataframes with ray
pip install pandera[modin-dask] # validate modin dataframes with dask
pip install pandera[geopandas] # validate geopandas geodataframes
conda install -c conda-forge pandera-hypotheses # hypothesis checks
conda install -c conda-forge pandera-io # yaml/script schema io utilities
conda install -c conda-forge pandera-strategies # data synthesis strategies
conda install -c conda-forge pandera-mypy # enable static type-linting of pandas
conda install -c conda-forge pandera-fastapi # fastapi integration
conda install -c conda-forge pandera-dask # validate dask dataframes
conda install -c conda-forge pandera-pyspark # validate pyspark dataframes
conda install -c conda-forge pandera-modin # validate modin dataframes
conda install -c conda-forge pandera-modin-ray # validate modin dataframes with ray
conda install -c conda-forge pandera-modin-dask # validate modin dataframes with dask
conda install -c conda-forge pandera-geopandas # validate geopandas geodataframes
Quick Start#
import pandas as pd
import pandera as pa
# data to validate
df = pd.DataFrame({
"column1": [1, 4, 0, 10, 9],
"column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
"column3": ["value_1", "value_2", "value_3", "value_2", "value_1"],
})
# define schema
schema = pa.DataFrameSchema({
"column1": pa.Column(int, checks=pa.Check.le(10)),
"column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
"column3": pa.Column(str, checks=[
pa.Check.str_startswith("value_"),
# define custom checks as functions that take a series as input and
# outputs a boolean or boolean Series
pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
]),
})
validated_df = schema(df)
print(validated_df)
column1 column2 column3
0 1 -1.3 value_1
1 4 -1.4 value_2
2 0 -2.9 value_3
3 10 -10.1 value_2
4 9 -20.4 value_1
You can pass the built-in python types that are supported by
pandas, or strings representing the
legal pandas datatypes,
or pandera’s DataType
:
schema = pa.DataFrameSchema({
# built-in python types
"int_column": pa.Column(int),
"float_column": pa.Column(float),
"str_column": pa.Column(str),
# pandas dtype string aliases
"int_column2": pa.Column("int64"),
"float_column2": pa.Column("float64"),
# pandas > 1.0.0 support native "string" type
"str_column2": pa.Column("str"),
# pandera DataType
"int_column3": pa.Column(pa.Int),
"float_column3": pa.Column(pa.Float),
"str_column3": pa.Column(pa.String),
})
For more details on data types, see DataType
Dataframe Model#
pandera
also provides an alternative API for expressing schemas inspired
by dataclasses and
pydantic. The equivalent
DataFrameModel
for the above
DataFrameSchema
would be:
from pandera.typing import Series
class Schema(pa.DataFrameModel):
column1: int = pa.Field(le=10)
column2: float = pa.Field(lt=-1.2)
column3: str = pa.Field(str_startswith="value_")
@pa.check("column3")
def column_3_check(cls, series: Series[str]) -> Series[bool]:
"""Check that column3 values have two elements after being split with '_'"""
return series.str.split("_", expand=True).shape[1] == 2
Schema.validate(df)
Informative Errors#
If the dataframe does not pass validation checks, pandera
provides
useful error messages. An error
argument can also be supplied to
Check
for custom error messages.
In the case that a validation Check
is violated:
import pandas as pd
from pandera import Column, DataFrameSchema, Int, Check
simple_schema = DataFrameSchema({
"column1": Column(
Int, Check(lambda x: 0 <= x <= 10, element_wise=True,
error="range checker [0, 10]"))
})
# validation rule violated
fail_check_df = pd.DataFrame({
"column1": [-20, 5, 10, 30],
})
simple_schema(fail_check_df)
Traceback (most recent call last):
...
SchemaError: <Schema Column: 'column1' type=<class 'int'>> failed element-wise validator 0:
<Check <lambda>: range checker [0, 10]>
failure cases:
index failure_case
0 0 -20
1 3 30
And in the case of a mis-specified column name:
# column name mis-specified
wrong_column_df = pd.DataFrame({
"foo": ["bar"] * 10,
"baz": [1] * 10
})
simple_schema.validate(wrong_column_df)
Traceback (most recent call last):
...
pandera.SchemaError: column 'column1' not in dataframe
foo baz
0 bar 1
1 bar 1
2 bar 1
3 bar 1
4 bar 1
Contributing#
All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.
A detailed overview on how to contribute can be found in the contributing guide on GitHub.
Issues#
Submit issues, feature requests or bugfixes on github.
Need Help?#
There are many ways of getting help with your questions. You can ask a question on Github Discussions page or reach out to the maintainers and pandera community on Discord
How to Cite#
If you use pandera
in the context of academic or industry research, please
consider citing the paper and/or software package.
Paper#
@InProceedings{ niels_bantilan-proc-scipy-2020,
author = { {N}iels {B}antilan },
title = { pandera: {S}tatistical {D}ata {V}alidation of {P}andas {D}ataframes },
booktitle = { {P}roceedings of the 19th {P}ython in {S}cience {C}onference },
pages = { 116 - 124 },
year = { 2020 },
editor = { {M}eghann {A}garwal and {C}hris {C}alloway and {D}illon {N}iederhut and {D}avid {S}hupe },
doi = { 10.25080/Majora-342d178e-010 }
}
Software Package#
License and Credits#
pandera
is licensed under the MIT license.
and is written and maintained by Niels Bantilan (niels@pandera.ci)