The Open-source Framework for Precision Data Testing#

Data validation for scientists, engineers, and analysts seeking correctness.

CI Build Documentation Stable Status pypi pypi versions pyOpenSci Review Project Status: Active – The project has reached a stable, usable state and is being actively developed. Documentation Latest Status Code Coverage PyPI pyversions DOI asv Monthly Downloads Total Downloads Conda Downloads Discord Community

pandera is a Union.ai open source project that provides a flexible and expressive API for performing data validation on dataframe-like objects to make data processing pipelines more readable and robust.

Dataframes contain information that pandera explicitly validates at runtime. This is useful in production-critical data pipelines or reproducible research settings. With pandera, you can:

  1. Define a schema once and use it to validate different dataframe types including pandas, dask, modin, and pyspark.pandas.

  2. Check the types and properties of columns in a pd.DataFrame or values in a pd.Series.

  3. Perform more complex statistical validation like hypothesis testing.

  4. Seamlessly integrate with existing data analysis/processing pipelines via function decorators.

  5. Define dataframe models with the class-based API with pydantic-style syntax and validate dataframes using the typing syntax.

  6. Synthesize data from schema objects for property-based testing with pandas data structures.

  7. Lazily Validate dataframes so that all validation rules are executed before raising an error.

  8. Integrate with a rich ecosystem of python tools like pydantic, fastapi and mypy.

Install#

Install with pip:

pip install pandera

Or conda:

conda install -c conda-forge pandera

Extras#

Installing additional functionality:

pip install pandera[hypotheses]  # hypothesis checks
pip install pandera[io]          # yaml/script schema io utilities
pip install pandera[strategies]  # data synthesis strategies
pip install pandera[mypy]        # enable static type-linting of pandas
pip install pandera[fastapi]     # fastapi integration
pip install pandera[dask]        # validate dask dataframes
pip install pandera[pyspark]     # validate pyspark dataframes
pip install pandera[modin]       # validate modin dataframes
pip install pandera[modin-ray]   # validate modin dataframes with ray
pip install pandera[modin-dask]  # validate modin dataframes with dask
pip install pandera[geopandas]   # validate geopandas geodataframes
conda install -c conda-forge pandera-hypotheses  # hypothesis checks
conda install -c conda-forge pandera-io          # yaml/script schema io utilities
conda install -c conda-forge pandera-strategies  # data synthesis strategies
conda install -c conda-forge pandera-mypy        # enable static type-linting of pandas
conda install -c conda-forge pandera-fastapi     # fastapi integration
conda install -c conda-forge pandera-dask        # validate dask dataframes
conda install -c conda-forge pandera-pyspark     # validate pyspark dataframes
conda install -c conda-forge pandera-modin       # validate modin dataframes
conda install -c conda-forge pandera-modin-ray   # validate modin dataframes with ray
conda install -c conda-forge pandera-modin-dask  # validate modin dataframes with dask
conda install -c conda-forge pandera-geopandas   # validate geopandas geodataframes

Quick Start#

import pandas as pd
import pandera as pa

# data to validate
df = pd.DataFrame({
    "column1": [1, 4, 0, 10, 9],
    "column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
    "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"],
})

# define schema
schema = pa.DataFrameSchema({
    "column1": pa.Column(int, checks=pa.Check.le(10)),
    "column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
    "column3": pa.Column(str, checks=[
        pa.Check.str_startswith("value_"),
        # define custom checks as functions that take a series as input and
        # outputs a boolean or boolean Series
        pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
    ]),
})

validated_df = schema(df)
print(validated_df)
   column1  column2  column3
0        1     -1.3  value_1
1        4     -1.4  value_2
2        0     -2.9  value_3
3       10    -10.1  value_2
4        9    -20.4  value_1

You can pass the built-in python types that are supported by pandas, or strings representing the legal pandas datatypes, or pandera’s DataType:

schema = pa.DataFrameSchema({
    # built-in python types
    "int_column": pa.Column(int),
    "float_column": pa.Column(float),
    "str_column": pa.Column(str),

    # pandas dtype string aliases
    "int_column2": pa.Column("int64"),
    "float_column2": pa.Column("float64"),
    # pandas > 1.0.0 support native "string" type
    "str_column2": pa.Column("str"),

    # pandera DataType
    "int_column3": pa.Column(pa.Int),
    "float_column3": pa.Column(pa.Float),
    "str_column3": pa.Column(pa.String),
})

For more details on data types, see DataType

Dataframe Model#

pandera also provides an alternative API for expressing schemas inspired by dataclasses and pydantic. The equivalent DataFrameModel for the above DataFrameSchema would be:

from pandera.typing import Series

class Schema(pa.DataFrameModel):

    column1: int = pa.Field(le=10)
    column2: float = pa.Field(lt=-1.2)
    column3: str = pa.Field(str_startswith="value_")

    @pa.check("column3")
    def column_3_check(cls, series: Series[str]) -> Series[bool]:
        """Check that column3 values have two elements after being split with '_'"""
        return series.str.split("_", expand=True).shape[1] == 2

Schema.validate(df)

Informative Errors#

If the dataframe does not pass validation checks, pandera provides useful error messages. An error argument can also be supplied to Check for custom error messages.

In the case that a validation Check is violated:

simple_schema = pa.DataFrameSchema({
    "column1": pa.Column(
        int, pa.Check(lambda x: 0 <= x <= 10, element_wise=True,
                   error="range checker [0, 10]"))
})

# validation rule violated
fail_check_df = pd.DataFrame({
    "column1": [-20, 5, 10, 30],
})

simple_schema(fail_check_df)
Traceback (most recent call last):
...
SchemaError: column 'column2' not in DataFrameSchema {'column1': <Schema Column(name=column1, type=DataType(int64))>}

And in the case of a mis-specified column name:

# column name mis-specified
wrong_column_df = pd.DataFrame({
   "foo": ["bar"] * 10,
   "baz": [1] * 10
})

simple_schema.validate(wrong_column_df)
Traceback (most recent call last):
...
pandera.SchemaError: column 'column1' not in dataframe
   foo  baz
0  bar    1
1  bar    1
2  bar    1
3  bar    1
4  bar    1

Error Reports#

If the dataframe is validated lazily with lazy=True, errors will be aggregated into an error report. The error report groups DATA and SCHEMA errors to to give an overview of error sources within a dataframe. Take the following schema and dataframe:

schema = pa.DataFrameSchema({"id": pa.Column(int, pa.Check.lt(10))}, name="MySchema", strict=True)
df = pd.DataFrame({"id": [1, None, 30], "extra_column": [1, 2, 3]})
schema.validate(df, lazy=True)

Validating the above dataframe will result in data level errors, namely the id column having a value which fails a check, as well as schema level errors, such as the extra column and the None value.

Traceback (most recent call last):
...
SchemaErrors: {
    "SCHEMA": {
        "COLUMN_NOT_IN_SCHEMA": [
            {
                "schema": "MySchema",
                "column": "MySchema",
                "check": "column_in_schema",
                "error": "column 'extra_column' not in DataFrameSchema {'id': <Schema Column(name=id, type=DataType(int64))>}"
            }
        ],
        "SERIES_CONTAINS_NULLS": [
            {
                "schema": "MySchema",
                "column": "id",
                "check": "not_nullable",
                "error": "non-nullable series 'id' contains null values:1   NaNName: id, dtype: float64"
            }
        ],
        "WRONG_DATATYPE": [
            {
                "schema": "MySchema",
                "column": "id",
                "check": "dtype('int64')",
                "error": "expected series 'id' to have type int64, got float64"
            }
        ]
    },
    "DATA": {
        "DATAFRAME_CHECK": [
            {
                "schema": "MySchema",
                "column": "id",
                "check": "less_than(10)",
                "error": "Column 'id' failed element-wise validator number 0: less_than(10) failure cases: 30.0"
            }
        ]
    }
}

This error report can be useful for debugging, with each item in the various lists corresponding to a SchemaError

Contributing#

All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.

A detailed overview on how to contribute can be found in the contributing guide on GitHub.

Issues#

Submit issues, feature requests or bugfixes on github.

Need Help?#

There are many ways of getting help with your questions. You can ask a question on Github Discussions page or reach out to the maintainers and pandera community on Discord

How to Cite#

If you use pandera in the context of academic or industry research, please consider citing the paper and/or software package.

Paper#

@InProceedings{ niels_bantilan-proc-scipy-2020,
  author    = { {N}iels {B}antilan },
  title     = { pandera: {S}tatistical {D}ata {V}alidation of {P}andas {D}ataframes },
  booktitle = { {P}roceedings of the 19th {P}ython in {S}cience {C}onference },
  pages     = { 116 - 124 },
  year      = { 2020 },
  editor    = { {M}eghann {A}garwal and {C}hris {C}alloway and {D}illon {N}iederhut and {D}avid {S}hupe },
  doi       = { 10.25080/Majora-342d178e-010 }
}

Software Package#

DOI

License and Credits#

pandera is licensed under the MIT license. and is written and maintained by Niels Bantilan (niels@pandera.ci)

Indices and tables#