A Statistical Data Testing Toolkit#

A data validation library for scientists, engineers, and analysts seeking correctness.

CI Build Documentation Stable Status pypi pypi versions pyOpenSci Review Project Status: Active – The project has reached a stable, usable state and is being actively developed. Documentation Latest Status Code Coverage PyPI pyversions DOI asv Monthly Downloads Total Downloads Conda Downloads Discord Community

pandera provides a flexible and expressive API for performing data validation on dataframe-like objects to make data processing pipelines more readable and robust.

Dataframes contain information that pandera explicitly validates at runtime. This is useful in production-critical data pipelines or reproducible research settings. With pandera, you can:

  1. Define a schema once and use it to validate different dataframe types including pandas, dask, modin, and pyspark.pandas.

  2. Check the types and properties of columns in a pd.DataFrame or values in a pd.Series.

  3. Perform more complex statistical validation like hypothesis testing.

  4. Seamlessly integrate with existing data analysis/processing pipelines via function decorators.

  5. Define schema models with the class-based API with pydantic-style syntax and validate dataframes using the typing syntax.

  6. Synthesize data from schema objects for property-based testing with pandas data structures.

  7. Lazily Validate dataframes so that all validation rules are executed before raising an error.

  8. Integrate with a rich ecosystem of python tools like pydantic, fastapi and mypy.


Install with pip:

pip install pandera

Or conda:

conda install -c conda-forge pandera


Installing additional functionality:

pip install pandera[hypotheses]  # hypothesis checks
pip install pandera[io]          # yaml/script schema io utilities
pip install pandera[strategies]  # data synthesis strategies
pip install pandera[mypy]        # enable static type-linting of pandas
pip install pandera[fastapi]     # fastapi integration
pip install pandera[dask]        # validate dask dataframes
pip install pandera[pyspark]     # validate pyspark dataframes
pip install pandera[modin]       # validate modin dataframes
pip install pandera[modin-ray]   # validate modin dataframes with ray
pip install pandera[modin-dask]  # validate modin dataframes with dask
pip install pandera[geopandas]   # validate geopandas geodataframes
conda install -c conda-forge pandera-hypotheses  # hypothesis checks
conda install -c conda-forge pandera-io          # yaml/script schema io utilities
conda install -c conda-forge pandera-strategies  # data synthesis strategies
conda install -c conda-forge pandera-mypy        # enable static type-linting of pandas
conda install -c conda-forge pandera-fastapi     # fastapi integration
conda install -c conda-forge pandera-dask        # validate dask dataframes
conda install -c conda-forge pandera-pyspark     # validate pyspark dataframes
conda install -c conda-forge pandera-modin       # validate modin dataframes
conda install -c conda-forge pandera-modin-ray   # validate modin dataframes with ray
conda install -c conda-forge pandera-modin-dask  # validate modin dataframes with dask
conda install -c conda-forge pandera-geopandas   # validate geopandas geodataframes

Quick Start#

import pandas as pd
import pandera as pa

# data to validate
df = pd.DataFrame({
    "column1": [1, 4, 0, 10, 9],
    "column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
    "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"],

# define schema
schema = pa.DataFrameSchema({
    "column1": pa.Column(int, checks=pa.Check.le(10)),
    "column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
    "column3": pa.Column(str, checks=[
        # define custom checks as functions that take a series as input and
        # outputs a boolean or boolean Series
        pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)

validated_df = schema(df)
   column1  column2  column3
0        1     -1.3  value_1
1        4     -1.4  value_2
2        0     -2.9  value_3
3       10    -10.1  value_2
4        9    -20.4  value_1

You can pass the built-in python types that are supported by pandas, or strings representing the legal pandas datatypes, or pandera’s DataType:

schema = pa.DataFrameSchema({
    # built-in python types
    "int_column": pa.Column(int),
    "float_column": pa.Column(float),
    "str_column": pa.Column(str),

    # pandas dtype string aliases
    "int_column2": pa.Column("int64"),
    "float_column2": pa.Column("float64"),
    # pandas > 1.0.0 support native "string" type
    "str_column2": pa.Column("str"),

    # pandera DataType
    "int_column3": pa.Column(pa.Int),
    "float_column3": pa.Column(pa.Float),
    "str_column3": pa.Column(pa.String),

For more details on data types, see DataType

Schema Model#

pandera also provides an alternative API for expressing schemas inspired by dataclasses and pydantic. The equivalent SchemaModel for the above DataFrameSchema would be:

from pandera.typing import Series

class Schema(pa.SchemaModel):

    column1: Series[int] = pa.Field(le=10)
    column2: Series[float] = pa.Field(lt=-1.2)
    column3: Series[str] = pa.Field(str_startswith="value_")

    def column_3_check(cls, series: Series[str]) -> Series[bool]:
        """Check that column3 values have two elements after being split with '_'"""
        return series.str.split("_", expand=True).shape[1] == 2


Informative Errors#

If the dataframe does not pass validation checks, pandera provides useful error messages. An error argument can also be supplied to Check for custom error messages.

In the case that a validation Check is violated:

import pandas as pd

from pandera import Column, DataFrameSchema, Int, Check

simple_schema = DataFrameSchema({
    "column1": Column(
        Int, Check(lambda x: 0 <= x <= 10, element_wise=True,
                   error="range checker [0, 10]"))

# validation rule violated
fail_check_df = pd.DataFrame({
    "column1": [-20, 5, 10, 30],

Traceback (most recent call last):
SchemaError: <Schema Column: 'column1' type=<class 'int'>> failed element-wise validator 0:
<Check <lambda>: range checker [0, 10]>
failure cases:
   index  failure_case
0      0           -20
1      3            30

And in the case of a mis-specified column name:

# column name mis-specified
wrong_column_df = pd.DataFrame({
   "foo": ["bar"] * 10,
   "baz": [1] * 10

Traceback (most recent call last):
pandera.SchemaError: column 'column1' not in dataframe
   foo  baz
0  bar    1
1  bar    1
2  bar    1
3  bar    1
4  bar    1


All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.

A detailed overview on how to contribute can be found in the contributing guide on GitHub.


Submit issues, feature requests or bugfixes on github.

Need Help?#

There are many ways of getting help with your questions. You can ask a question on Github Discussions page or reach out to the maintainers and pandera community on Discord

How to Cite#

If you use pandera in the context of academic or industry research, please consider citing the paper and/or software package.


@InProceedings{ niels_bantilan-proc-scipy-2020,
  author    = { {N}iels {B}antilan },
  title     = { pandera: {S}tatistical {D}ata {V}alidation of {P}andas {D}ataframes },
  booktitle = { {P}roceedings of the 19th {P}ython in {S}cience {C}onference },
  pages     = { 116 - 124 },
  year      = { 2020 },
  editor    = { {M}eghann {A}garwal and {C}hris {C}alloway and {D}illon {N}iederhut and {D}avid {S}hupe },
  doi       = { 10.25080/Majora-342d178e-010 }

Software Package#

software package

License and Credits#

pandera is licensed under the MIT license. and is written and maintained by Niels Bantilan (niels@pandera.ci)

Indices and tables#