The Open-source Framework for Precision Data Testing#
Data validation for scientists, engineers, and analysts seeking correctness.
pandera
is a Union.ai
open source project that provides a flexible and expressive API for performing data
validation on dataframe-like objects to make data processing pipelines more readable
and robust.
Dataframes contain information that pandera
explicitly validates at runtime.
This is useful in production-critical data pipelines or reproducible research
settings. With pandera
, you can:
Define a schema once and use it to validate different dataframe types including pandas, dask, modin, and pyspark.pandas.
Check the types and properties of columns in a
pd.DataFrame
or values in apd.Series
.Perform more complex statistical validation like hypothesis testing.
Seamlessly integrate with existing data analysis/processing pipelines via function decorators.
Define dataframe models with the class-based API with pydantic-style syntax and validate dataframes using the typing syntax.
Synthesize data from schema objects for property-based testing with pandas data structures.
Lazily Validate dataframes so that all validation rules are executed before raising an error.
Integrate with a rich ecosystem of python tools like pydantic, fastapi and mypy.
Install#
Install with pip
:
pip install pandera
Or conda
:
conda install -c conda-forge pandera
Extras#
Installing additional functionality:
pip install pandera[hypotheses] # hypothesis checks
pip install pandera[io] # yaml/script schema io utilities
pip install pandera[strategies] # data synthesis strategies
pip install pandera[mypy] # enable static type-linting of pandas
pip install pandera[fastapi] # fastapi integration
pip install pandera[dask] # validate dask dataframes
pip install pandera[pyspark] # validate pyspark dataframes
pip install pandera[modin] # validate modin dataframes
pip install pandera[modin-ray] # validate modin dataframes with ray
pip install pandera[modin-dask] # validate modin dataframes with dask
pip install pandera[geopandas] # validate geopandas geodataframes
conda install -c conda-forge pandera-hypotheses # hypothesis checks
conda install -c conda-forge pandera-io # yaml/script schema io utilities
conda install -c conda-forge pandera-strategies # data synthesis strategies
conda install -c conda-forge pandera-mypy # enable static type-linting of pandas
conda install -c conda-forge pandera-fastapi # fastapi integration
conda install -c conda-forge pandera-dask # validate dask dataframes
conda install -c conda-forge pandera-pyspark # validate pyspark dataframes
conda install -c conda-forge pandera-modin # validate modin dataframes
conda install -c conda-forge pandera-modin-ray # validate modin dataframes with ray
conda install -c conda-forge pandera-modin-dask # validate modin dataframes with dask
conda install -c conda-forge pandera-geopandas # validate geopandas geodataframes
Quick Start#
import pandas as pd
import pandera as pa
# data to validate
df = pd.DataFrame({
"column1": [1, 4, 0, 10, 9],
"column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
"column3": ["value_1", "value_2", "value_3", "value_2", "value_1"],
})
# define schema
schema = pa.DataFrameSchema({
"column1": pa.Column(int, checks=pa.Check.le(10)),
"column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
"column3": pa.Column(str, checks=[
pa.Check.str_startswith("value_"),
# define custom checks as functions that take a series as input and
# outputs a boolean or boolean Series
pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
]),
})
validated_df = schema(df)
print(validated_df)
column1 column2 column3
0 1 -1.3 value_1
1 4 -1.4 value_2
2 0 -2.9 value_3
3 10 -10.1 value_2
4 9 -20.4 value_1
You can pass the built-in python types that are supported by
pandas, or strings representing the
legal pandas datatypes,
or pandera’s DataType
:
schema = pa.DataFrameSchema({
# built-in python types
"int_column": pa.Column(int),
"float_column": pa.Column(float),
"str_column": pa.Column(str),
# pandas dtype string aliases
"int_column2": pa.Column("int64"),
"float_column2": pa.Column("float64"),
# pandas > 1.0.0 support native "string" type
"str_column2": pa.Column("str"),
# pandera DataType
"int_column3": pa.Column(pa.Int),
"float_column3": pa.Column(pa.Float),
"str_column3": pa.Column(pa.String),
})
For more details on data types, see DataType
Dataframe Model#
pandera
also provides an alternative API for expressing schemas inspired
by dataclasses and
pydantic. The equivalent
DataFrameModel
for the above
DataFrameSchema
would be:
from pandera.typing import Series
class Schema(pa.DataFrameModel):
column1: int = pa.Field(le=10)
column2: float = pa.Field(lt=-1.2)
column3: str = pa.Field(str_startswith="value_")
@pa.check("column3")
def column_3_check(cls, series: Series[str]) -> Series[bool]:
"""Check that column3 values have two elements after being split with '_'"""
return series.str.split("_", expand=True).shape[1] == 2
Schema.validate(df)
Informative Errors#
If the dataframe does not pass validation checks, pandera
provides
useful error messages. An error
argument can also be supplied to
Check
for custom error messages.
In the case that a validation Check
is violated:
simple_schema = pa.DataFrameSchema({
"column1": pa.Column(
int, pa.Check(lambda x: 0 <= x <= 10, element_wise=True,
error="range checker [0, 10]"))
})
# validation rule violated
fail_check_df = pd.DataFrame({
"column1": [-20, 5, 10, 30],
})
simple_schema(fail_check_df)
Traceback (most recent call last):
...
SchemaError: column 'column2' not in DataFrameSchema {'column1': <Schema Column(name=column1, type=DataType(int64))>}
And in the case of a mis-specified column name:
# column name mis-specified
wrong_column_df = pd.DataFrame({
"foo": ["bar"] * 10,
"baz": [1] * 10
})
simple_schema.validate(wrong_column_df)
Traceback (most recent call last):
...
pandera.SchemaError: column 'column1' not in dataframe
foo baz
0 bar 1
1 bar 1
2 bar 1
3 bar 1
4 bar 1
Error Reports#
If the dataframe is validated lazily with lazy=True
, errors will be aggregated
into an error report. The error report groups DATA
and SCHEMA
errors to
to give an overview of error sources within a dataframe. Take the following schema
and dataframe:
schema = pa.DataFrameSchema({"id": pa.Column(int, pa.Check.lt(10))}, name="MySchema", strict=True)
df = pd.DataFrame({"id": [1, None, 30], "extra_column": [1, 2, 3]})
schema.validate(df, lazy=True)
Validating the above dataframe will result in data level errors, namely the id
column having a value which fails a check, as well as schema level errors, such as the
extra column and the None
value.
Traceback (most recent call last):
...
SchemaErrors: {
"SCHEMA": {
"COLUMN_NOT_IN_SCHEMA": [
{
"schema": "MySchema",
"column": "MySchema",
"check": "column_in_schema",
"error": "column 'extra_column' not in DataFrameSchema {'id': <Schema Column(name=id, type=DataType(int64))>}"
}
],
"SERIES_CONTAINS_NULLS": [
{
"schema": "MySchema",
"column": "id",
"check": "not_nullable",
"error": "non-nullable series 'id' contains null values:1 NaNName: id, dtype: float64"
}
],
"WRONG_DATATYPE": [
{
"schema": "MySchema",
"column": "id",
"check": "dtype('int64')",
"error": "expected series 'id' to have type int64, got float64"
}
]
},
"DATA": {
"DATAFRAME_CHECK": [
{
"schema": "MySchema",
"column": "id",
"check": "less_than(10)",
"error": "Column 'id' failed element-wise validator number 0: less_than(10) failure cases: 30.0"
}
]
}
}
This error report can be useful for debugging, with each item in the various
lists corresponding to a SchemaError
Contributing#
All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.
A detailed overview on how to contribute can be found in the contributing guide on GitHub.
Issues#
Submit issues, feature requests or bugfixes on github.
Need Help?#
There are many ways of getting help with your questions. You can ask a question on Github Discussions page or reach out to the maintainers and pandera community on Discord
How to Cite#
If you use pandera
in the context of academic or industry research, please
consider citing the paper and/or software package.
Paper#
@InProceedings{ niels_bantilan-proc-scipy-2020,
author = { {N}iels {B}antilan },
title = { pandera: {S}tatistical {D}ata {V}alidation of {P}andas {D}ataframes },
booktitle = { {P}roceedings of the 19th {P}ython in {S}cience {C}onference },
pages = { 116 - 124 },
year = { 2020 },
editor = { {M}eghann {A}garwal and {C}hris {C}alloway and {D}illon {N}iederhut and {D}avid {S}hupe },
doi = { 10.25080/Majora-342d178e-010 }
}
Software Package#
License and Credits#
pandera
is licensed under the MIT license.
and is written and maintained by Niels Bantilan (niels@pandera.ci)