A Statistical Data Testing Toolkit#

A data validation library for scientists, engineers, and analysts seeking correctness.

CI Build Documentation Stable Status pypi pypi versions pyOpenSci Review Project Status: Active – The project has reached a stable, usable state and is being actively developed. Documentation Latest Status Code Coverage PyPI pyversions DOI asv Monthly Downloads Total Downloads Conda Downloads Discord Community

pandera provides a flexible and expressive API for performing data validation on dataframe-like objects to make data processing pipelines more readable and robust.

Dataframes contain information that pandera explicitly validates at runtime. This is useful in production-critical data pipelines or reproducible research settings. With pandera, you can:

  1. Define a schema once and use it to validate different dataframe types including pandas, dask, modin, and pyspark.pandas.

  2. Check the types and properties of columns in a pd.DataFrame or values in a pd.Series.

  3. Perform more complex statistical validation like hypothesis testing.

  4. Seamlessly integrate with existing data analysis/processing pipelines via function decorators.

  5. Define schema models with the class-based API with pydantic-style syntax and validate dataframes using the typing syntax.

  6. Synthesize data from schema objects for property-based testing with pandas data structures.

  7. Lazily Validate dataframes so that all validation rules are executed before raising an error.

  8. Integrate with a rich ecosystem of python tools like pydantic, fastapi and mypy.

Install#

Install with pip:

pip install pandera

Or conda:

conda install -c conda-forge pandera

Extras#

Installing additional functionality:

pip install pandera[hypotheses]  # hypothesis checks
pip install pandera[io]          # yaml/script schema io utilities
pip install pandera[strategies]  # data synthesis strategies
pip install pandera[mypy]        # enable static type-linting of pandas
pip install pandera[fastapi]     # fastapi integration
pip install pandera[dask]        # validate dask dataframes
pip install pandera[pyspark]     # validate pyspark dataframes
pip install pandera[modin]       # validate modin dataframes
pip install pandera[modin-ray]   # validate modin dataframes with ray
pip install pandera[modin-dask]  # validate modin dataframes with dask
pip install pandera[geopandas]   # validate geopandas geodataframes
conda install -c conda-forge pandera-hypotheses  # hypothesis checks
conda install -c conda-forge pandera-io          # yaml/script schema io utilities
conda install -c conda-forge pandera-strategies  # data synthesis strategies
conda install -c conda-forge pandera-mypy        # enable static type-linting of pandas
conda install -c conda-forge pandera-fastapi     # fastapi integration
conda install -c conda-forge pandera-dask        # validate dask dataframes
conda install -c conda-forge pandera-pyspark     # validate pyspark dataframes
conda install -c conda-forge pandera-modin       # validate modin dataframes
conda install -c conda-forge pandera-modin-ray   # validate modin dataframes with ray
conda install -c conda-forge pandera-modin-dask  # validate modin dataframes with dask
conda install -c conda-forge pandera-geopandas   # validate geopandas geodataframes

Quick Start#

import pandas as pd
import pandera as pa

# data to validate
df = pd.DataFrame({
    "column1": [1, 4, 0, 10, 9],
    "column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
    "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"],
})

# define schema
schema = pa.DataFrameSchema({
    "column1": pa.Column(int, checks=pa.Check.le(10)),
    "column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
    "column3": pa.Column(str, checks=[
        pa.Check.str_startswith("value_"),
        # define custom checks as functions that take a series as input and
        # outputs a boolean or boolean Series
        pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
    ]),
})

validated_df = schema(df)
print(validated_df)
   column1  column2  column3
0        1     -1.3  value_1
1        4     -1.4  value_2
2        0     -2.9  value_3
3       10    -10.1  value_2
4        9    -20.4  value_1

You can pass the built-in python types that are supported by pandas, or strings representing the legal pandas datatypes, or pandera’s DataType:

schema = pa.DataFrameSchema({
    # built-in python types
    "int_column": pa.Column(int),
    "float_column": pa.Column(float),
    "str_column": pa.Column(str),

    # pandas dtype string aliases
    "int_column2": pa.Column("int64"),
    "float_column2": pa.Column("float64"),
    # pandas > 1.0.0 support native "string" type
    "str_column2": pa.Column("str"),

    # pandera DataType
    "int_column3": pa.Column(pa.Int),
    "float_column3": pa.Column(pa.Float),
    "str_column3": pa.Column(pa.String),
})

For more details on data types, see DataType

Schema Model#

pandera also provides an alternative API for expressing schemas inspired by dataclasses and pydantic. The equivalent SchemaModel for the above DataFrameSchema would be:

from pandera.typing import Series

class Schema(pa.SchemaModel):

    column1: Series[int] = pa.Field(le=10)
    column2: Series[float] = pa.Field(lt=-1.2)
    column3: Series[str] = pa.Field(str_startswith="value_")

    @pa.check("column3")
    def column_3_check(cls, series: Series[str]) -> Series[bool]:
        """Check that column3 values have two elements after being split with '_'"""
        return series.str.split("_", expand=True).shape[1] == 2

Schema.validate(df)

Informative Errors#

If the dataframe does not pass validation checks, pandera provides useful error messages. An error argument can also be supplied to Check for custom error messages.

In the case that a validation Check is violated:

import pandas as pd

from pandera import Column, DataFrameSchema, Int, Check

simple_schema = DataFrameSchema({
    "column1": Column(
        Int, Check(lambda x: 0 <= x <= 10, element_wise=True,
                   error="range checker [0, 10]"))
})

# validation rule violated
fail_check_df = pd.DataFrame({
    "column1": [-20, 5, 10, 30],
})

simple_schema(fail_check_df)
Traceback (most recent call last):
...
SchemaError: <Schema Column: 'column1' type=<class 'int'>> failed element-wise validator 0:
<Check <lambda>: range checker [0, 10]>
failure cases:
   index  failure_case
0      0           -20
1      3            30

And in the case of a mis-specified column name:

# column name mis-specified
wrong_column_df = pd.DataFrame({
   "foo": ["bar"] * 10,
   "baz": [1] * 10
})

simple_schema.validate(wrong_column_df)
Traceback (most recent call last):
...
pandera.SchemaError: column 'column1' not in dataframe
   foo  baz
0  bar    1
1  bar    1
2  bar    1
3  bar    1
4  bar    1

Contributing#

All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.

A detailed overview on how to contribute can be found in the contributing guide on GitHub.

Issues#

Submit issues, feature requests or bugfixes on github.

Need Help?#

There are many ways of getting help with your questions. You can ask a question on Github Discussions page or reach out to the maintainers and pandera community on Discord

Try Pandera#

Tip

You can access the full screen jupyter notebook environment here.

DataFrame Schemas#

The DataFrameSchema class enables the specification of a schema that verifies the columns and index of a pandas DataFrame object.

The DataFrameSchema object consists of Columns and an Index.

import pandera as pa

from pandera import Column, DataFrameSchema, Check, Index

schema = DataFrameSchema(
    {
        "column1": Column(int),
        "column2": Column(float, Check(lambda s: s < -1.2)),
        # you can provide a list of validators
        "column3": Column(str, [
           Check(lambda s: s.str.startswith("value")),
           Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
        ]),
    },
    index=Index(int),
    strict=True,
    coerce=True,
)

You can refer to Schema Models to see how to define dataframe schemas using the alternative pydantic/dataclass-style syntax.

Column Validation#

A Column must specify the properties of a column in a dataframe object. It can be optionally verified for its data type, null values or duplicate values. The column can be coerced into the specified type, and the required parameter allows control over whether or not the column is allowed to be missing.

Similarly to pandas, the data type can be specified as:

  • a string alias, as long as it is recognized by pandas.

  • a python type: int, float, double, bool, str

  • a numpy data type

  • a pandas extension type: it can be an instance (e.g pd.CategoricalDtype([“a”, “b”])) or a class (e.g pandas.CategoricalDtype) if it can be initialized with default values.

  • a pandera DataType: it can also be an instance or a class.

Column checks allow for the DataFrame’s values to be checked against a user-provided function. Check objects also support grouping by a different column so that the user can make assertions about subsets of the column of interest.

Column Hypotheses enable you to perform statistical hypothesis tests on a DataFrame in either wide or tidy format. See Hypothesis Testing for more details.

Null Values in Columns#

By default, SeriesSchema/Column objects assume that values are not nullable. In order to accept null values, you need to explicitly specify nullable=True, or else you’ll get an error.

import numpy as np
import pandas as pd
import pandera as pa

from pandera import Check, Column, DataFrameSchema

df = pd.DataFrame({"column1": [5, 1, np.nan]})

non_null_schema = DataFrameSchema({
    "column1": Column(float, Check(lambda x: x > 0))
})

non_null_schema.validate(df)
Traceback (most recent call last):
...
SchemaError: non-nullable series contains null values: {2: nan}
null_schema = DataFrameSchema({
    "column1": Column(float, Check(lambda x: x > 0), nullable=True)
})

print(null_schema.validate(df))
   column1
0      5.0
1      1.0
2      NaN
Coercing Types on Columns#

If you specify Column(dtype, ..., coerce=True) as part of the DataFrameSchema definition, calling schema.validate will first coerce the column into the specified dtype before applying validation checks.

import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema

df = pd.DataFrame({"column1": [1, 2, 3]})
schema = DataFrameSchema({"column1": Column(str, coerce=True)})

validated_df = schema.validate(df)
assert isinstance(validated_df.column1.iloc[0], str)

Note

Note the special case of integers columns not supporting nan values. In this case, schema.validate will complain if coerce == True and null values are allowed in the column.

df = pd.DataFrame({"column1": [1., 2., 3, np.nan]})
schema = DataFrameSchema({
    "column1": Column(int, coerce=True, nullable=True)
})

validated_df = schema.validate(df)
Traceback (most recent call last):
...
pandera.errors.SchemaError: Error while coercing 'column1' to type int64: Cannot convert non-finite values (NA or inf) to integer

The best way to handle this case is to simply specify the column as a Float or Object.

schema_object = DataFrameSchema({
    "column1": Column(object, coerce=True, nullable=True)
})
schema_float = DataFrameSchema({
    "column1": Column(float, coerce=True, nullable=True)
})

print(schema_object.validate(df).dtypes)
print(schema_float.validate(df).dtypes)
column1    object
dtype: object
column1    float64
dtype: object

If you want to coerce all of the columns specified in the DataFrameSchema, you can specify the coerce argument with DataFrameSchema(..., coerce=True).

Required Columns#

By default all columns specified in the schema are required, meaning that if a column is missing in the input DataFrame an exception will be thrown. If you want to make a column optional, specify required=False in the column constructor:

import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema

df = pd.DataFrame({"column2": ["hello", "pandera"]})
schema = DataFrameSchema({
    "column1": Column(int, required=False),
    "column2": Column(str)
})

validated_df = schema.validate(df)
print(validated_df)
   column2
0    hello
1  pandera

Since required=True by default, missing columns would raise an error:

schema = DataFrameSchema({
    "column1": Column(int),
    "column2": Column(str),
})

schema.validate(df)
Traceback (most recent call last):
...
pandera.SchemaError: column 'column1' not in dataframe
   column2
0    hello
1  pandera
Ordered Columns#
Stand-alone Column Validation#

In addition to being used in the context of a DataFrameSchema, Column objects can also be used to validate columns in a dataframe on its own:

import pandas as pd
import pandera as pa

df = pd.DataFrame({
    "column1": [1, 2, 3],
    "column2": ["a", "b", "c"],
})

column1_schema = pa.Column(int, name="column1")
column2_schema = pa.Column(str, name="column2")

# pass the dataframe as an argument to the Column object callable
df = column1_schema(df)
validated_df = column2_schema(df)

# or explicitly use the validate method
df = column1_schema.validate(df)
validated_df = column2_schema.validate(df)

# use the DataFrame.pipe method to validate two columns
validated_df = df.pipe(column1_schema).pipe(column2_schema)

For multi-column use cases, the DataFrameSchema is still recommended, but if you have one or a small number of columns to verify, using Column objects by themselves is appropriate.

Column Regex Pattern Matching#

In the case that your dataframe has multiple columns that share common statistical properties, you might want to specify a regex pattern that matches a set of meaningfully grouped columns that have str names.

import numpy as np
import pandas as pd
import pandera as pa

categories = ["A", "B", "C"]

np.random.seed(100)

dataframe = pd.DataFrame({
    "cat_var_1": np.random.choice(categories, size=100),
    "cat_var_2": np.random.choice(categories, size=100),
    "num_var_1": np.random.uniform(0, 10, size=100),
    "num_var_2": np.random.uniform(20, 30, size=100),
})

schema = pa.DataFrameSchema({
    "num_var_.+": pa.Column(
        float,
        checks=pa.Check.greater_than_or_equal_to(0),
        regex=True,
    ),
    "cat_var_.+": pa.Column(
        pa.Category,
        checks=pa.Check.isin(categories),
        coerce=True,
        regex=True,
    ),
})

print(schema.validate(dataframe).head())
  cat_var_1 cat_var_2  num_var_1  num_var_2
0         A         A   6.804147  24.743304
1         A         C   3.684308  22.774633
2         A         C   5.911288  28.416588
3         C         A   4.790627  21.951250
4         C         B   4.504166  28.563142

You can also regex pattern match on pd.MultiIndex columns:

np.random.seed(100)

dataframe = pd.DataFrame({
    ("cat_var_1", "y1"): np.random.choice(categories, size=100),
    ("cat_var_2", "y2"): np.random.choice(categories, size=100),
    ("num_var_1", "x1"): np.random.uniform(0, 10, size=100),
    ("num_var_2", "x2"): np.random.uniform(0, 10, size=100),
})

schema = pa.DataFrameSchema({
    ("num_var_.+", "x.+"): pa.Column(
        float,
        checks=pa.Check.greater_than_or_equal_to(0),
        regex=True,
    ),
    ("cat_var_.+", "y.+"): pa.Column(
        pa.Category,
        checks=pa.Check.isin(categories),
        coerce=True,
        regex=True,
    ),
})

print(schema.validate(dataframe).head())
  cat_var_1 cat_var_2 num_var_1 num_var_2
         y1        y2        x1        x2
0         A         A  6.804147  4.743304
1         A         C  3.684308  2.774633
2         A         C  5.911288  8.416588
3         C         A  4.790627  1.951250
4         C         B  4.504166  8.563142
Handling Dataframe Columns not in the Schema#

By default, columns that aren’t specified in the schema aren’t checked. If you want to check that the DataFrame only contains columns in the schema, specify strict=True:

import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema

schema = DataFrameSchema(
    {"column1": Column(int)},
    strict=True)

df = pd.DataFrame({"column2": [1, 2, 3]})

schema.validate(df)
Traceback (most recent call last):
...
SchemaError: column 'column2' not in DataFrameSchema {'column1': <Schema Column: 'None' type=DataType(int64)>}

Alternatively, if your DataFrame contains columns that are not in the schema, and you would like these to be dropped on validation, you can specify strict='filter'.

import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema

df = pd.DataFrame({"column1": ["drop", "me"],"column2": ["keep", "me"]})
schema = DataFrameSchema({"column2": Column(str)}, strict='filter')

validated_df = schema.validate(df)
print(validated_df)
   column2
0     keep
1       me
Validating the order of the columns#

For some applications the order of the columns is important. For example:

  • If you want to use selection by position instead of the more common selection by label.

  • Machine learning: Many ML libraries will cast a Dataframe to numpy arrays, for which order becomes crucial.

To validate the order of the Dataframe columns, specify ordered=True:

import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema(
    columns={"a": pa.Column(int), "b": pa.Column(int)}, ordered=True
)
df = pd.DataFrame({"b": [1], "a": [1]})
print(schema.validate(df))
Traceback (most recent call last):
...
SchemaError: column 'b' out-of-order
Validating the joint uniqueness of columns#

In some cases you might want to ensure that a group of columns are unique:

import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema(
    columns={col: pa.Column(int) for col in ["a", "b", "c"]},
    unique=["a", "c"],
)
df = pd.DataFrame.from_records([
    {"a": 1, "b": 2, "c": 3},
    {"a": 1, "b": 2, "c": 3},
])
schema.validate(df)
Traceback (most recent call last):
...
SchemaError: columns '('a', 'c')' not unique:
column  index  failure_case
0      a      0             1
1      a      1             1
2      c      0             3
3      c      1             3
To control how unique errors are reported, the report_duplicates argument accepts:
  • exclude_first: (default) report all duplicates except first occurence

  • exclude_last: report all duplicates except last occurence

  • all: report all duplicates

import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema(
    columns={col: pa.Column(int) for col in ["a", "b", "c"]},
    unique=["a", "c"],
    report_duplicates = "exclude_first",
)
df = pd.DataFrame.from_records([
    {"a": 1, "b": 2, "c": 3},
    {"a": 1, "b": 2, "c": 3},
])
schema.validate(df)
Traceback (most recent call last):
...
SchemaError: columns '('a', 'c')' not unique:
column  index  failure_case
0      a      1             1
1      c      1             3

Index Validation#

You can also specify an Index in the DataFrameSchema.

import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema, Index, Check

schema = DataFrameSchema(
   columns={"a": Column(int)},
   index=Index(
       str,
       Check(lambda x: x.str.startswith("index_"))))

df = pd.DataFrame(
    data={"a": [1, 2, 3]},
    index=["index_1", "index_2", "index_3"])

print(schema.validate(df))
         a
index_1  1
index_2  2
index_3  3

In the case that the DataFrame index doesn’t pass the Check.

df = pd.DataFrame(
    data={"a": [1, 2, 3]},
    index=["foo1", "foo2", "foo3"])

schema.validate(df)
Traceback (most recent call last):
...
SchemaError: <Schema Index> failed element-wise validator 0:
<lambda>
failure cases:
             index  count
failure_case
foo1           [0]      1
foo2           [1]      1
foo3           [2]      1

MultiIndex Validation#

pandera also supports multi-index column and index validation.

MultiIndex Columns#

Specifying multi-index columns follows the pandas syntax of specifying tuples for each level in the index hierarchy:

import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema, Index

schema = DataFrameSchema({
    ("foo", "bar"): Column(int),
    ("foo", "baz"): Column(str)
})

df = pd.DataFrame({
    ("foo", "bar"): [1, 2, 3],
    ("foo", "baz"): ["a", "b", "c"],
})

print(schema.validate(df))
  foo
  bar baz
0   1   a
1   2   b
2   3   c
MultiIndex Indexes#

The MultiIndex class allows you to define multi-index indexes by composing a list of pandera.Index objects.

import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema, Index, MultiIndex, Check

schema = DataFrameSchema(
    columns={"column1": Column(int)},
    index=MultiIndex([
        Index(str,
              Check(lambda s: s.isin(["foo", "bar"])),
              name="index0"),
        Index(int, name="index1"),
    ])
)

df = pd.DataFrame(
    data={"column1": [1, 2, 3]},
    index=pd.MultiIndex.from_arrays(
        [["foo", "bar", "foo"], [0, 1,2 ]],
        names=["index0", "index1"]
    )
)

print(schema.validate(df))
               column1
index0 index1
foo    0             1
bar    1             2
foo    2             3

Get Pandas Data Types#

Pandas provides a dtype parameter for casting a dataframe to a specific dtype schema. DataFrameSchema provides a dtypes property which returns a dictionary whose keys are column names and values are DataType.

Some examples of where this can be provided to pandas are:

import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema(
    columns={
      "column1": pa.Column(int),
      "column2": pa.Column(pa.Category),
      "column3": pa.Column(bool)
    },
)

df = (
    pd.DataFrame.from_dict(
        {
            "a": {"column1": 1, "column2": "valueA", "column3": True},
            "b": {"column1": 1, "column2": "valueB", "column3": True},
        },
        orient="index",
    )
    .astype({col: str(dtype) for col, dtype in schema.dtypes.items()})
    .sort_index(axis=1)
)

print(schema.validate(df))
   column1 column2  column3
a        1  valueA     True
b        1  valueB     True

DataFrameSchema Transformations#

Once you’ve defined a schema, you can then make modifications to it, both on the schema level – such as adding or removing columns and setting or resetting the index – or on the column level – such as changing the data type or checks.

This is useful for re-using schema objects in a data pipeline when additional computation has been done on a dataframe, where the column objects may have changed or perhaps where additional checks may be required.

import pandas as pd
import pandera as pa

data = pd.DataFrame({"col1": range(1, 6)})

schema = pa.DataFrameSchema(
    columns={"col1": pa.Column(int, pa.Check(lambda s: s >= 0))},
    strict=True)

transformed_schema = schema.add_columns({
    "col2": pa.Column(str, pa.Check(lambda s: s == "value")),
    "col3": pa.Column(float, pa.Check(lambda x: x == 0.0)),
})

# validate original data
data = schema.validate(data)

# transformation
transformed_data = data.assign(col2="value", col3=0.0)

# validate transformed data
print(transformed_schema.validate(transformed_data))
   col1   col2  col3
0     1  value   0.0
1     2  value   0.0
2     3  value   0.0
3     4  value   0.0
4     5  value   0.0

Similarly, if you want dropped columns to be explicitly validated in a data pipeline:

import pandera as pa

schema = pa.DataFrameSchema(
    columns={
        "col1": pa.Column(int, pa.Check(lambda s: s >= 0)),
        "col2": pa.Column(str, pa.Check(lambda x: x <= 0)),
        "col3": pa.Column(object, pa.Check(lambda x: x == 0)),
    },
    strict=True,
)

new_schema = schema.remove_columns(["col2", "col3"])
print(new_schema)
<Schema DataFrameSchema(
    columns={
        'col1': <Schema Column(name=col1, type=DataType(int64))>
    },
    checks=[],
    coerce=False,
    dtype=None,
    index=None,
    strict=True
    name=None,
    ordered=False,
    unique_column_names=False
)>

If during the course of a data pipeline one of your columns is moved into the index, you can simply update the initial input schema using the set_index() method to create a schema for the pipeline output.

import pandera as pa

from pandera import Column, DataFrameSchema, Check, Index

schema = DataFrameSchema(
    {
        "column1": Column(int),
        "column2": Column(float)
    },
    index=Index(int, name = "column3"),
    strict=True,
    coerce=True,
)
print(schema.set_index(["column1"], append = True))
<Schema DataFrameSchema(
    columns={
        'column2': <Schema Column(name=column2, type=DataType(float64))>
    },
    checks=[],
    coerce=True,
    dtype=None,
    index=<Schema MultiIndex(
        indexes=[
            <Schema Index(name=column3, type=DataType(int64))>
            <Schema Index(name=column1, type=DataType(int64))>
        ]
        coerce=False,
        strict=False,
        name=None,
        ordered=True
    )>,
    strict=True
    name=None,
    ordered=False,
    unique_column_names=False
)>

The available methods for altering the schema are: add_columns() , remove_columns(), update_columns(), rename_columns(), set_index(), and reset_index().

Schema Models#

new in 0.5.0

pandera provides a class-based API that’s heavily inspired by pydantic. In contrast to the object-based API, you can define schema models in much the same way you’d define pydantic models.

Schema Models are annotated with the pandera.typing module using the standard typing syntax. Models can be explicitly converted to a DataFrameSchema or used to validate a DataFrame directly.

Note

Due to current limitations in the pandas library (see discussion here), pandera annotations are only used for run-time validation and cannot be leveraged by static-type checkers like mypy. See the discussion here for more details.

Basic Usage#

import pandas as pd
import pandera as pa
from pandera.typing import Index, DataFrame, Series


class InputSchema(pa.SchemaModel):
    year: Series[int] = pa.Field(gt=2000, coerce=True)
    month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
    day: Series[int] = pa.Field(ge=0, le=365, coerce=True)

class OutputSchema(InputSchema):
    revenue: Series[float]

@pa.check_types
def transform(df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
    return df.assign(revenue=100.0)


df = pd.DataFrame({
    "year": ["2001", "2002", "2003"],
    "month": ["3", "6", "12"],
    "day": ["200", "156", "365"],
})

transform(df)

invalid_df = pd.DataFrame({
    "year": ["2001", "2002", "1999"],
    "month": ["3", "6", "12"],
    "day": ["200", "156", "365"],
})
transform(invalid_df)
Traceback (most recent call last):
...
pandera.errors.SchemaError: <Schema Column: 'year' type=DataType(int64)> failed element-wise validator 0:
<Check greater_than: greater_than(2000)>
failure cases:
   index  failure_case
0      2          1999

As you can see in the example above, you can define a schema by sub-classing SchemaModel and defining column/index fields as class attributes. The check_types() decorator is required to perform validation of the dataframe at run-time.

Note that Field s apply to both Column and Index objects, exposing the built-in Check s via key-word arguments.

(New in 0.6.2) When you access a class attribute defined on the schema, it will return the name of the column used in the validated pd.DataFrame. In the example above, this will simply be the string “year”.

print(f"Column name for 'year' is {InputSchema.year}\n")
print(df.loc[:, [InputSchema.year, "day"]])
Column name for 'year' is year

   year  day
0  2001  200
1  2002  156
2  2003  365

Validate on Initialization#

new in 0.8.0

Pandera provides an interface for validating dataframes on initialization. This API uses the pandera.typing.pandas.DataFrame generic type to validated against the SchemaModel type variable on initialization:

import pandas as pd
import pandera as pa

from pandera.typing import DataFrame, Series


class Schema(pa.SchemaModel):
    state: Series[str]
    city: Series[str]
    price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})

df = DataFrame[Schema](
    {
        'state': ['NY','FL','GA','CA'],
        'city': ['New York', 'Miami', 'Atlanta', 'San Francisco'],
        'price': [8, 12, 10, 16],
    }
)
print(df)
  state           city  price
0    NY       New York      8
1    FL          Miami     12
2    GA        Atlanta     10
3    CA  San Francisco     16

Refer to Supported DataFrame Libraries to see how this syntax applies to other supported dataframe types.

Converting to DataFrameSchema#

You can easily convert a SchemaModel class into a DataFrameSchema:

print(InputSchema.to_schema())
<Schema DataFrameSchema(
    columns={
        'year': <Schema Column(name=year, type=DataType(int64))>
        'month': <Schema Column(name=month, type=DataType(int64))>
        'day': <Schema Column(name=day, type=DataType(int64))>
    },
    checks=[],
    coerce=False,
    dtype=None,
    index=None,
    strict=False
    name=InputSchema,
    ordered=False,
    unique_column_names=False
)>

You can also use the validate() method to validate dataframes:

print(InputSchema.validate(df))
   year  month  day
0  2001      3  200
1  2002      6  156
2  2003     12  365

Or you can use the SchemaModel() class directly to validate dataframes, which is syntactic sugar that simply delegates to the validate() method.

print(InputSchema(df))
   year  month  day
0  2001      3  200
1  2002      6  156
2  2003     12  365

Excluded attributes#

Class variables which begin with an underscore will be automatically excluded from the model. Config is also a reserved name. However, aliases can be used to circumvent these limitations.

Supported dtypes#

Any dtypes supported by pandera can be used as type parameters for Series and Index. There are, however, a couple of gotchas.

Dtype aliases#
import pandera as pa
from pandera.typing import Series, String

class Schema(pa.SchemaModel):
    a: Series[String]
Type Vs instance#

You must give a type, not an instance.

Good:

import pandas as pd

class Schema(pa.SchemaModel):
    a: Series[pd.StringDtype]

Bad:

class Schema(pa.SchemaModel):
    a: Series[pd.StringDtype()]
Traceback (most recent call last):
...
TypeError: Parameters to generic types must be types. Got string[python].
Parametrized dtypes#

Pandas supports a couple of parametrized dtypes. As of pandas 1.2.0:

Kind of Data

Data Type

Parameters

tz-aware datetime

DatetimeTZDtype

unit, tz

Categorical

CategoricalDtype

categories, ordered

period

PeriodDtype

freq

sparse

SparseDtype

dtype, fill_value

intervals

IntervalDtype

subtype

Annotated#

Parameters can be given via typing.Annotated. It requires python >= 3.9 or typing_extensions, which is already a requirement of Pandera. Unfortunately typing.Annotated has not been backported to python 3.6.

Good:

try:
    from typing import Annotated  # python 3.9+
except ImportError:
    from typing_extensions import Annotated

class Schema(pa.SchemaModel):
    col: Series[Annotated[pd.DatetimeTZDtype, "ns", "est"]]

Furthermore, you must pass all parameters in the order defined in the dtype’s constructor (see table).

Bad:

class Schema(pa.SchemaModel):
    col: Series[Annotated[pd.DatetimeTZDtype, "utc"]]

Schema.to_schema()
Traceback (most recent call last):
...
TypeError: Annotation 'DatetimeTZDtype' requires all positional arguments ['unit', 'tz'].
Field#

Good:

class SchemaFieldDatetimeTZDtype(pa.SchemaModel):
    col: Series[pd.DatetimeTZDtype] = pa.Field(dtype_kwargs={"unit": "ns", "tz": "EST"})

You cannot use both typing.Annotated and dtype_kwargs.

Bad:

class SchemaFieldDatetimeTZDtype(pa.SchemaModel):
    col: Series[Annotated[pd.DatetimeTZDtype, "ns", "est"]] = pa.Field(dtype_kwargs={"unit": "ns", "tz": "EST"})

Schema.to_schema()
Traceback (most recent call last):
...
TypeError: Cannot specify redundant 'dtype_kwargs' for pandera.typing.Series[typing_extensions.Annotated[pandas.core.dtypes.dtypes.DatetimeTZDtype, 'ns', 'est']].
Usage Tip: Drop 'typing.Annotated'.

Required Columns#

By default all columns specified in the schema are required, meaning that if a column is missing in the input DataFrame an exception will be thrown. If you want to make a column optional, annotate it with typing.Optional.

from typing import Optional

import pandas as pd
import pandera as pa
from pandera.typing import Series


class Schema(pa.SchemaModel):
    a: Series[str]
    b: Optional[Series[int]]


df = pd.DataFrame({"a": ["2001", "2002", "2003"]})
Schema.validate(df)

Schema Inheritance#

You can also use inheritance to build schemas on top of a base schema.

class BaseSchema(pa.SchemaModel):
    year: Series[str]

class FinalSchema(BaseSchema):
    year: Series[int] = pa.Field(ge=2000, coerce=True)  # overwrite the base type
    passengers: Series[int]
    idx: Index[int] = pa.Field(ge=0)

df = pd.DataFrame({
    "year": ["2000", "2001", "2002"],
})

@pa.check_types
def transform(df: DataFrame[BaseSchema]) -> DataFrame[FinalSchema]:
    return (
        df.assign(passengers=[61000, 50000, 45000])
        .set_index(pd.Index([1, 2, 3]))
        .astype({"year": int})
    )

print(transform(df))
   year  passengers
1  2000       61000
2  2001       50000
3  2002       45000

Config#

Schema-wide options can be controlled via the Config class on the SchemaModel subclass. The full set of options can be found in the BaseConfig class.

class Schema(pa.SchemaModel):

    year: Series[int] = pa.Field(gt=2000, coerce=True)
    month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
    day: Series[int] = pa.Field(ge=0, le=365, coerce=True)

    class Config:
        name = "BaseSchema"
        strict = True
        coerce = True
        foo = "bar"  # Interpreted as dataframe check

It is not required for the Config to subclass BaseConfig but it must be named ‘Config’.

See Registered Custom Checks with the Class-based API for details on using registered dataframe checks.

MultiIndex#

The MultiIndex capabilities are also supported with the class-based API:

import pandera as pa
from pandera.typing import Index, Series

class MultiIndexSchema(pa.SchemaModel):

    year: Index[int] = pa.Field(gt=2000, coerce=True)
    month: Index[int] = pa.Field(ge=1, le=12, coerce=True)
    passengers: Series[int]

    class Config:
        # provide multi index options in the config
        multiindex_name = "time"
        multiindex_strict = True
        multiindex_coerce = True

index = MultiIndexSchema.to_schema().index
print(index)
<Schema MultiIndex(
    indexes=[
        <Schema Index(name=year, type=DataType(int64))>
        <Schema Index(name=month, type=DataType(int64))>
    ]
    coerce=True,
    strict=True,
    name=time,
    ordered=True
)>
from pprint import pprint

pprint({name: col.checks for name, col in index.columns.items()})
{'month': [<Check greater_than_or_equal_to: greater_than_or_equal_to(1)>,
        <Check less_than_or_equal_to: less_than_or_equal_to(12)>],
'year': [<Check greater_than: greater_than(2000)>]}

Multiple Index annotations are automatically converted into a MultiIndex. MultiIndex options are given in the Config.

Index Name#

Use check_name to validate the index name of a single-index dataframe:

import pandas as pd
import pandera as pa
from pandera.typing import Index, Series

class Schema(pa.SchemaModel):
    year: Series[int] = pa.Field(gt=2000, coerce=True)
    passengers: Series[int]
    idx: Index[int] = pa.Field(ge=0, check_name=True)

df = pd.DataFrame({
    "year": [2001, 2002, 2003],
    "passengers": [61000, 50000, 45000],
})

Schema.validate(df)
Traceback (most recent call last):
...
pandera.errors.SchemaError: Expected <class 'pandera.schema_components.Index'> to have name 'idx', found 'None'

check_name default value of None translates to True for columns and multi-index.

Custom Checks#

Unlike the object-based API, custom checks can be specified as class methods.

Column/Index checks#
import pandera as pa
from pandera.typing import Index, Series

class CustomCheckSchema(pa.SchemaModel):

    a: Series[int] = pa.Field(gt=0, coerce=True)
    abc: Series[int]
    idx: Index[str]

    @pa.check("a", name="foobar")
    def custom_check(cls, a: Series[int]) -> Series[bool]:
        return a < 100

    @pa.check("^a", regex=True, name="foobar")
    def custom_check_regex(cls, a: Series[int]) -> Series[bool]:
        return a > 0

    @pa.check("idx")
    def check_idx(cls, idx: Index[int]) -> Series[bool]:
        return idx.str.contains("dog")

Note

  • You can supply the key-word arguments of the Check class initializer to get the flexibility of groupby checks

  • Similarly to pydantic, classmethod() decorator is added behind the scenes if omitted.

  • You still may need to add the @classmethod decorator after the check() decorator if your static-type checker or linter complains.

  • Since checks are class methods, the first argument value they receive is a SchemaModel subclass, not an instance of a model.

from typing import Dict

class GroupbyCheckSchema(pa.SchemaModel):

    value: Series[int] = pa.Field(gt=0, coerce=True)
    group: Series[str] = pa.Field(isin=["A", "B"])

    @pa.check("value", groupby="group", regex=True, name="check_means")
    def check_groupby(cls, grouped_value: Dict[str, Series[int]]) -> bool:
        return grouped_value["A"].mean() < grouped_value["B"].mean()

df = pd.DataFrame({
    "value": [100, 110, 120, 10, 11, 12],
    "group": list("AAABBB"),
})

print(GroupbyCheckSchema.validate(df))
Traceback (most recent call last):
...
pandera.errors.SchemaError: <Schema Column: 'value' type=DataType(int64)> failed series validator 1:
<Check check_means>
DataFrame Checks#

You can also define dataframe-level checks, similar to the object-based API, using the dataframe_check() decorator:

import pandas as pd
import pandera as pa
from pandera.typing import Index, Series

class DataFrameCheckSchema(pa.SchemaModel):

    col1: Series[int] = pa.Field(gt=0, coerce=True)
    col2: Series[float] = pa.Field(gt=0, coerce=True)
    col3: Series[float] = pa.Field(lt=0, coerce=True)

    @pa.dataframe_check
    def product_is_negative(cls, df: pd.DataFrame) -> Series[bool]:
        return df["col1"] * df["col2"] * df["col3"] < 0

df = pd.DataFrame({
    "col1": [1, 2, 3],
    "col2": [5, 6, 7],
    "col3": [-1, -2, -3],
})

DataFrameCheckSchema.validate(df)
Inheritance#

The custom checks are inherited and therefore can be overwritten by the subclass.

import pandas as pd
import pandera as pa
from pandera.typing import Index, Series

class Parent(pa.SchemaModel):

    a: Series[int] = pa.Field(coerce=True)

    @pa.check("a", name="foobar")
    def check_a(cls, a: Series[int]) -> Series[bool]:
        return a < 100


class Child(Parent):

    a: Series[int] = pa.Field(coerce=False)

    @pa.check("a", name="foobar")
    def check_a(cls, a: Series[int]) -> Series[bool]:
        return a > 100

is_a_coerce = Child.to_schema().columns["a"].coerce
print(f"coerce: {is_a_coerce}")
coerce: False
df = pd.DataFrame({"a": [1, 2, 3]})
print(Child.validate(df))
Traceback (most recent call last):
...
pandera.errors.SchemaError: <Schema Column: 'a' type=DataType(int64)> failed element-wise validator 0:
<Check foobar>
failure cases:
    index  failure_case
0      0             1
1      1             2
2      2             3

Aliases#

SchemaModel supports columns which are not valid python variable names via the argument alias of Field.

Checks must reference the aliased names.

import pandera as pa
import pandas as pd

class Schema(pa.SchemaModel):
    col_2020: pa.typing.Series[int] = pa.Field(alias=2020)
    idx: pa.typing.Index[int] = pa.Field(alias="_idx", check_name=True)

    @pa.check(2020)
    def int_column_lt_100(cls, series):
        return series < 100


df = pd.DataFrame({2020: [99]}, index=[0])
df.index.name = "_idx"

print(Schema.validate(df))
      2020
_idx
0       99

(New in 0.6.2) The alias is respected when using the class attribute to get the underlying pd.DataFrame column name or index level name.

print(Schema.col_2020)
2020

Very similar to the example above, you can also use the variable name directly within the class scope, and it will respect the alias.

Note

To access a variable from the class scope, you need to make it a class attribute, and therefore assign it a default Field.

import pandera as pa
import pandas as pd

class Schema(pa.SchemaModel):
    a: pa.typing.Series[int] = pa.Field()
    col_2020: pa.typing.Series[int] = pa.Field(alias=2020)

    @pa.check(col_2020)
    def int_column_lt_100(cls, series):
        return series < 100

    @pa.check(a)
    def int_column_gt_100(cls, series):
        return series > 100


df = pd.DataFrame({2020: [99], "a": [101]})
print(Schema.validate(df))
      2020    a
0       99  101

Series Schemas#

The SeriesSchema class allows for the validation of pandas Series objects, and are very similar to columns and indexes described in DataFrameSchemas.

import pandas as pd
import pandera as pa


# specify multiple validators
schema = pa.SeriesSchema(
    str,
    checks=[
        pa.Check(lambda s: s.str.startswith("foo")),
        pa.Check(lambda s: s.str.endswith("bar")),
        pa.Check(lambda x: len(x) > 3, element_wise=True)
    ],
    nullable=False,
    unique=False,
    name="my_series")

validated_series = schema.validate(
    pd.Series(["foobar", "foobar", "foobar"], name="my_series"))
print(validated_series)
0    foobar
1    foobar
2    foobar
Name: my_series, dtype: object

Checks#

Checking column properties#

Check objects accept a function as a required argument, which is expected to take a pa.Series input and output a boolean or a Series of boolean values. For the check to pass, all of the elements in the boolean series must evaluate to True, for example:

import pandera as pa

check_lt_10 = pa.Check(lambda s: s <= 10)

schema = pa.DataFrameSchema({"column1": pa.Column(int, check_lt_10)})
schema.validate(pd.DataFrame({"column1": range(10)}))

Multiple checks can be applied to a column:

schema = pa.DataFrameSchema({
    "column2": pa.Column(str, [
        pa.Check(lambda s: s.str.startswith("value")),
        pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
    ]),
})

Built-in Checks#

For common validation tasks, built-in checks are available in pandera.

import pandera as pa
from pandera import Column, Check, DataFrameSchema

schema = DataFrameSchema({
    "small_values": Column(float, Check.less_than(100)),
    "one_to_three": Column(int, Check.isin([1, 2, 3])),
    "phone_number": Column(str, Check.str_matches(r'^[a-z0-9-]+$')),
})

See the Check API reference for a complete list of built-in checks.

Vectorized vs. Element-wise Checks#

By default, Check objects operate on pd.Series objects. If you want to make atomic checks for each element in the Column, then you can provide the element_wise=True keyword argument:

import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema({
    "a": pa.Column(
        int,
        checks=[
            # a vectorized check that returns a bool
            pa.Check(lambda s: s.mean() > 5, element_wise=False),

            # a vectorized check that returns a boolean series
            pa.Check(lambda s: s > 0, element_wise=False),

            # an element-wise check that returns a bool
            pa.Check(lambda x: x > 0, element_wise=True),
        ]
    ),
})
df = pd.DataFrame({"a": [4, 4, 5, 6, 6, 7, 8, 9]})
schema.validate(df)

element_wise == False by default so that you can take advantage of the speed gains provided by the pd.Series API by writing vectorized checks.

Handling Null Values#

By default, pandera drops null values before passing the objects to validate into the check function. For Series objects null elements are dropped (this also applies to columns), and for DataFrame objects, rows with any null value are dropped.

If you want to check the properties of a pandas data structure while preserving null values, specify Check(..., ignore_na=False) when defining a check.

Note that this is different from the nullable argument in Column objects, which simply checks for null values in a column.

Column Check Groups#

Column checks support grouping by a different column so that you can make assertions about subsets of the column of interest. This changes the function signature of the Check function so that its input is a dict where keys are the group names and values are subsets of the series being validated.

Specifying groupby as a column name, list of column names, or callable changes the expected signature of the Check function argument to:

Callable[Dict[Any, pd.Series] -> Union[bool, pd.Series]

where the dict keys are the discrete keys in the groupby columns.

In the example below we define a DataFrameSchema with column checks for height_in_feet using a single column, multiple columns, and a more complex groupby function that creates a new column age_less_than_15 on the fly.

import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema({
    "height_in_feet": pa.Column(
        float, [
            # groupby as a single column
            pa.Check(
                lambda g: g[False].mean() > 6,
                groupby="age_less_than_20"),

            # define multiple groupby columns
            pa.Check(
                lambda g: g[(True, "F")].sum() == 9.1,
                groupby=["age_less_than_20", "sex"]),

            # groupby as a callable with signature:
            # (DataFrame) -> DataFrameGroupBy
            pa.Check(
                lambda g: g[(False, "M")].median() == 6.75,
                groupby=lambda df: (
                    df.assign(age_less_than_15=lambda d: d["age"] < 15)
                    .groupby(["age_less_than_15", "sex"]))),
        ]),
    "age": pa.Column(int, pa.Check(lambda s: s > 0)),
    "age_less_than_20": pa.Column(bool),
    "sex": pa.Column(str, pa.Check(lambda s: s.isin(["M", "F"])))
})

df = (
    pd.DataFrame({
        "height_in_feet": [6.5, 7, 6.1, 5.1, 4],
        "age": [25, 30, 21, 18, 13],
        "sex": ["M", "M", "F", "F", "F"]
    })
    .assign(age_less_than_20=lambda x: x["age"] < 20)
)

schema.validate(df)

Wide Checks#

pandera is primarily designed to operate on long-form data (commonly known as tidy data), where each row is an observation and each column is an attribute associated with an observation.

However, pandera also supports checks on wide-form data to operate across columns in a DataFrame. For example, if you want to make assertions about height across two groups, the tidy dataset and schema might look like this:

import pandas as pd
import pandera as pa


df = pd.DataFrame({
    "height": [5.6, 6.4, 4.0, 7.1],
    "group": ["A", "B", "A", "B"],
})

schema = pa.DataFrameSchema({
    "height": pa.Column(
        float,
        pa.Check(lambda g: g["A"].mean() < g["B"].mean(), groupby="group")
    ),
    "group": pa.Column(str)
})

schema.validate(df)

Whereas the equivalent wide-form schema would look like this:

df = pd.DataFrame({
    "height_A": [5.6, 4.0],
    "height_B": [6.4, 7.1],
})

schema = pa.DataFrameSchema(
    columns={
        "height_A": pa.Column(float),
        "height_B": pa.Column(float),
    },
    # define checks at the DataFrameSchema-level
    checks=pa.Check(
        lambda df: df["height_A"].mean() < df["height_B"].mean()
    )
)

schema.validate(df)

You can see that when checks are supplied to the DataFrameSchema checks key-word argument, the check function should expect a pandas DataFrame and should return a bool, a Series of booleans, or a DataFrame of boolean values.

Raise UserWarning on Check Failure#

In some cases, you might want to raise a UserWarning and continue execution of your program. The Check and Hypothesis classes and their built-in methods support the keyword argument raise_warning, which is False by default. If set to True, the check will raise a UserWarning instead of raising a SchemaError exception.

Note

Use this feature carefully! If the check is for informational purposes and not critical for data integrity then use raise_warning=True. However, if the assumptions expressed in a Check are necessary conditions to considering your data valid, do not set this option to true.

One scenario where you’d want to do this would be in a data pipeline that does some preprocessing, checks for normality in certain columns, and writes the resulting dataset to a table. In this case, you want to see if your normality assumptions are not fulfilled by certain columns, but you still want the resulting table for further analysis.

import warnings

import numpy as np
import pandas as pd
import pandera as pa

from scipy.stats import normaltest


np.random.seed(1000)

df = pd.DataFrame({
    "var1": np.random.normal(loc=0, scale=1, size=1000),
    "var2": np.random.uniform(low=0, high=10, size=1000),
})

normal_check = pa.Hypothesis(
    test=normaltest,
    samples="normal_variable",
    # null hypotheses: sample comes from a normal distribution. The
    # relationship function checks if we cannot reject the null hypothesis,
    # i.e. the p-value is greater or equal to alpha.
    relationship=lambda stat, pvalue, alpha=0.05: pvalue >= alpha,
    error="normality test",
    raise_warning=True,
)

schema = pa.DataFrameSchema(
    columns={
        "var1": pa.Column(checks=normal_check),
        "var2": pa.Column(checks=normal_check),
    }
)

# catch and print warnings
with warnings.catch_warnings(record=True) as caught_warnings:
    warnings.simplefilter("always")
    validated_df = schema(df)
    for warning in caught_warnings:
        print(warning.message)
<Schema Column(name=var2, type=None)> failed series or dataframe validator 0:
<Check _hypothesis_check: normality test>

Registering Custom Checks#

pandera now offers an interface to register custom checks functions so that they’re available in the Check namespace. See the extensions document for more information.

Hypothesis Testing#

pandera enables you to perform statistical hypothesis tests on your data.

Note

The hypothesis feature requires a pandera installation with hypotheses dependency set. See the installation instructions for more details.

Overview#

The Hypothesis class defines built in methods, which can be called as in this example of a two-sample t-test:

import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema, Check, Hypothesis

from scipy import stats

df = (
    pd.DataFrame({
        "height_in_feet": [6.5, 7, 6.1, 5.1, 4],
        "sex": ["M", "M", "F", "F", "F"]
    })
)

schema = DataFrameSchema({
    "height_in_feet": Column(
        float, [
            Hypothesis.two_sample_ttest(
                sample1="M",
                sample2="F",
                groupby="sex",
                relationship="greater_than",
                alpha=0.05,
                equal_var=True),
    ]),
    "sex": Column(str)
})

schema.validate(df)
Traceback (most recent call last):
...
pandera.SchemaError: <Schema Column: 'height_in_feet' type=float64> failed series validator 0: hypothesis_check: failed two sample ttest between 'M' and 'F'

You can also define custom hypotheses by passing in functions to the test and relationship arguments.

The test function takes as input one or multiple array-like objects and should return a stat, which is the test statistic, and pvalue for assessing statistical significance. It also takes key-word arguments supplied by the test_kwargs dict when initializing a Hypothesis object.

The relationship function should take all of the outputs of test as positional arguments, in addition to key-word arguments supplied by the relationship_kwargs dict.

Here’s an implementation of the two-sample t-test that uses the scipy implementation:

def two_sample_ttest(array1, array2):
    # the "height_in_feet" series is first grouped by "sex" and then
    # passed into the custom `test` function as two separate arrays in the
    # order specified in the `samples` argument.
    return stats.ttest_ind(array1, array2)


def null_relationship(stat, pvalue, alpha=0.01):
    return pvalue / 2 >= alpha


schema = DataFrameSchema({
    "height_in_feet": Column(
        float, [
            Hypothesis(
                test=two_sample_ttest,
                samples=["M", "F"],
                groupby="sex",
                relationship=null_relationship,
                relationship_kwargs={"alpha": 0.05}
            )
    ]),
    "sex": Column(str, checks=Check.isin(["M", "F"]))
})

schema.validate(df)

Wide Hypotheses#

pandera is primarily designed to operate on long-form data (commonly known as tidy data), where each row is an observation and columns are attributes associated with the observation.

However, pandera also supports hypothesis testing on wide-form data to operate across columns in a DataFrame.

For example, if you want to make assertions about height across two groups, the tidy dataset and schema might look like this:

import pandas as pd
import pandera as pa

from pandera import Check, DataFrameSchema, Column, Hypothesis

df = pd.DataFrame({
    "height": [5.6, 7.5, 4.0, 7.9],
    "group": ["A", "B", "A", "B"],
})

schema = DataFrameSchema({
    "height": Column(
        float, Hypothesis.two_sample_ttest(
            "A", "B",
            groupby="group",
            relationship="less_than",
            alpha=0.05
        )
    ),
    "group": Column(str, Check(lambda s: s.isin(["A", "B"])))
})

schema.validate(df)

The equivalent wide-form schema would look like this:

import pandas as pd
import pandera as pa

from pandera import DataFrameSchema, Column, Hypothesis

df = pd.DataFrame({
    "height_A": [5.6, 4.0],
    "height_B": [7.5, 7.9],
})

schema = DataFrameSchema(
    columns={
        "height_A": Column(Float),
        "height_B": Column(Float),
    },
    # define checks at the DataFrameSchema-level
    checks=Hypothesis.two_sample_ttest(
        "height_A", "height_B",
        relationship="less_than",
        alpha=0.05
    )
)

schema.validate(df)

Pandera Data Types#

new in 0.7.0

Motivations#

Pandera defines its own interface for data types in order to abstract the specifics of dataframe-like data structures in the python ecosystem, such as Apache Spark, Apache Arrow and xarray.

Note

In the following section Pandera Data Type refers to a pandera.dtypes.DataType object whereas native data type refers to data types used by third-party libraries that Pandera supports (e.g. pandas).

Most of the time, it is transparent to end users since pandera columns and indexes accept native data types. However, it is possible to extend the pandera interface by:

  • modifying the data type check performed during schema validation.

  • modifying the behavior of the coerce argument for DataFrameSchema.

  • adding your own custom data types.

DataType basics#

All pandera data types inherit from pandera.dtypes.DataType and must be hashable.

A data type implements three key methods:

For pandera’s validation methods to be aware of a data type, it has to be registered with the targeted engine via pandera.engines.engine.Engine.register_dtype(). An engine is in charge of mapping a pandera DataType with a native data type counterpart belonging to a third-party library. The mapping can be queried with pandera.engines.engine.Engine.dtype().

As of pandera 0.7.0, only the pandas Engine is supported.

Example#

Let’s extend pandas.BooleanDtype coercion to handle the string literals "True" and "False".

import pandas as pd
import pandera as pa
from pandera import dtypes
from pandera.engines import pandas_engine


@pandas_engine.Engine.register_dtype  # step 1
@dtypes.immutable  # step 2
class LiteralBool(pandas_engine.BOOL):  # step 3
    def coerce(self, series: pd.Series) -> pd.Series:
        """Coerce a pandas.Series to boolean types."""
        if pd.api.types.is_string_dtype(series):
            series = series.replace({"True": 1, "False": 0})
        return series.astype("boolean")


data = pd.Series(["True", "False"], name="literal_bools")

# step 4
print(
    pa.SeriesSchema(LiteralBool(), coerce=True, name="literal_bools")
    .validate(data)
    .dtype
)
boolean

The example above performs the following steps:

  1. Register the data type with the pandas engine.

  2. pandera.dtypes.immutable() creates an immutable (and hashable) dataclass().

  3. Inherit pandera.engines.pandas_engine.BOOL, which is the pandera representation of pandas.BooleanDtype. This is not mandatory but it makes our life easier by having already implemented all the required methods.

  4. Check that our new data type can coerce the string literals.

So far we did not override the default behavior:

import pandera as pa

pa.SeriesSchema("boolean", coerce=True).validate(data)
Traceback (most recent call last):
...
pandera.errors.SchemaError: Error while coercing 'literal_bools' to type boolean: Need to pass bool-like values

To completely replace the default BOOL, we need to supply all the equivalent representations to register_dtype(). Behind the scenes, when pa.SeriesSchema("boolean") is called the corresponding pandera data type is looked up using pandera.engines.engine.Engine.dtype().

print(f"before: {pandas_engine.Engine.dtype('boolean').__class__}")


@pandas_engine.Engine.register_dtype(
    equivalents=["boolean", pd.BooleanDtype, pd.BooleanDtype()],
)
@dtypes.immutable
class LiteralBool(pandas_engine.BOOL):
    def coerce(self, series: pd.Series) -> pd.Series:
        """Coerce a pandas.Series to boolean types."""
        if pd.api.types.is_string_dtype(series):
            series = series.replace({"True": 1, "False": 0})
        return series.astype("boolean")


print(f"after: {pandas_engine.Engine.dtype('boolean').__class__}")

for dtype in ["boolean", pd.BooleanDtype, pd.BooleanDtype()]:
    pa.SeriesSchema(dtype, coerce=True).validate(data)
before: <class 'pandera.engines.pandas_engine.BOOL'>
after: <class 'LiteralBool'>

Note

For convenience, we specified both pd.BooleanDtype and pd.BooleanDtype() as equivalents. That gives us more flexibility in what pandera schemas can recognize (see last for-loop above).

Parametrized data types#

Some data types can be parametrized. One common example is pandas.CategoricalDtype.

The equivalents argument of register_dtype() does not handle this situation but will automatically register a classmethod() with signature from_parametrized_dtype(cls, equivalent:...) if the decorated DataType defines it. The equivalent argument must be type-annotated because it is leveraged to dispatch the input of dtype to the appropriate from_parametrized_dtype class method.

For example, here is a snippet from pandera.engines.pandas_engine.Category:

import pandas as pd
from pandera import dtypes

@classmethod
def from_parametrized_dtype(
    cls, cat: Union[dtypes.Category, pd.CategoricalDtype]
):
    """Convert a categorical to
    a Pandera :class:`pandera.dtypes.pandas_engine.Category`."""
    return cls(categories=cat.categories, ordered=cat.ordered)  # type: ignore

Note

The dispatch mechanism relies on functools.singledispatch(). Unlike the built-in implementation, typing.Union is recognized.

Defining the coerce_value method#

For pandera datatypes to understand how to correctly report coercion errors, it needs to know how to coerce an individual value into the specified type.

All pandas data types are supported: numpy -based datatypes use the underlying numpy dtype to coerce an individual value. The pandas -native datatypes like CategoricalDtype and BooleanDtype are also supported.

As an example of a special-cased coerce_value implementation, see the source code for pandera.engines.pandas_engine.Category.coerce_value():

def coerce_value(self, value: Any) -> Any:
    """Coerce an value to a particular type."""
    if value not in self.categories:  # type: ignore
        raise TypeError(
            f"value {value} cannot be coerced to type {self.type}"
        )
    return value

Logical data types#

Taking inspiration from the visions project, pandera provides an interface for defining logical data types.

Physical types represent the actual, underlying representation of the data. e.g.: Int8, Float32, String, etc., whereas logical types represent the abstracted understanding of that data. e.g.: IPs, URLs, paths, etc.

Validating a logical data type consists of validating the supporting physical data type (see Motivations) and a check on actual values. For example, an IP address data type would validate that:

  1. The data container type is a String.

  2. The actual values are well-formed addresses.

Non-native Pandas dtype can also be wrapped in a numpy.object_ and verified using the data, since the object dtype alone is not enough to verify the correctness. An example would be the standard decimal.Decimal class that can be validated via the pandera DataType Decimal.

To implement a logical data type, you just need to implement the method pandera.dtypes.DataType.check() and make use of the data_container argument to perform checks on the values of the data.

For example, you can create an IPAddress datatype that inherits from the numpy string physical type, thereby storing the values as strings, and checks whether the values actually match an IP address regular expression.

import re
from typing import Optional, Iterable, Union

@pandas_engine.Engine.register_dtype
@dtypes.immutable
class IPAddress(pandas_engine.NpString):

    def check(
        self,
        pandera_dtype: dtypes.DataType,
        data_container: Optional[pd.Series] = None,
    ) -> Union[bool, Iterable[bool]]:

        # ensure that the data container's data type is a string,
        # using the parent class's check implementation
        correct_type = super().check(pandera_dtype)
        if not correct_type:
            return correct_type

        # ensure the filepaths actually exist locally
        exp = re.compile(r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})")
        return data_container.map(lambda x: exp.match(x) is not None)

    def __str__(self) -> str:
        return str(self.__class__.__name__)

    def __repr__(self) -> str:
        return f"DataType({self})"


schema = pa.DataFrameSchema(columns={"ips": pa.Column(IPAddress)})
schema.validate(pd.DataFrame({"ips": ["0.0.0.0", "0.0.0.1", "0.0.0.a"]}))
Traceback (most recent call last):
...
pandera.errors.SchemaError: expected series 'ips' to have type IPAddress:
failure cases:
index failure_case
0      2      0.0.0.a

Decorators for Pipeline Integration#

If you have an existing data pipeline that uses pandas data structures, you can use the check_input() and check_output() decorators to easily check function arguments or returned variables from existing functions.

Check Input#

Validates input pandas DataFrame/Series before entering the wrapped function.

import pandas as pd
import pandera as pa

from pandera import DataFrameSchema, Column, Check, check_input


df = pd.DataFrame({
   "column1": [1, 4, 0, 10, 9],
   "column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
})

in_schema = DataFrameSchema({
   "column1": Column(int,
                     Check(lambda x: 0 <= x <= 10, element_wise=True)),
   "column2": Column(float, Check(lambda x: x < -1.2)),
})

# by default, check_input assumes that the first argument is
# dataframe/series.
@check_input(in_schema)
def preprocessor(dataframe):
    dataframe["column3"] = dataframe["column1"] + dataframe["column2"]
    return dataframe

preprocessed_df = preprocessor(df)
print(preprocessed_df)
   column1  column2  column3
0        1     -1.3     -0.3
1        4     -1.4      2.6
2        0     -2.9     -2.9
3       10    -10.1     -0.1
4        9    -20.4    -11.4

You can also provide the argument name as a string

@check_input(in_schema, "dataframe")
def preprocessor(dataframe):
    ...

Or an integer representing the index in the positional arguments.

@check_input(in_schema, 1)
def preprocessor(foo, dataframe):
    ...

Check Output#

The same as check_input, but this decorator checks the output DataFrame/Series of the decorated function.

import pandas as pd
import pandera as pa

from pandera import DataFrameSchema, Column, Check, check_output


preprocessed_df = pd.DataFrame({
   "column1": [1, 4, 0, 10, 9],
})

# assert that all elements in "column1" are zero
out_schema = DataFrameSchema({
    "column1": Column(int, Check(lambda x: x == 0))
})


# by default assumes that the pandas DataFrame/Schema is the only output
@check_output(out_schema)
def zero_column_1(df):
    df["column1"] = 0
    return df


# you can also specify in the index of the argument if the output is list-like
@check_output(out_schema, 1)
def zero_column_1_arg(df):
    df["column1"] = 0
    return "foobar", df


# or the key containing the data structure to verify if the output is dict-like
@check_output(out_schema, "out_df")
def zero_column_1_dict(df):
    df["column1"] = 0
    return {"out_df": df, "out_str": "foobar"}


# for more complex outputs, you can specify a function
@check_output(out_schema, lambda x: x[1]["out_df"])
def zero_column_1_custom(df):
    df["column1"] = 0
    return ("foobar", {"out_df": df})


zero_column_1(preprocessed_df)
zero_column_1_arg(preprocessed_df)
zero_column_1_dict(preprocessed_df)
zero_column_1_custom(preprocessed_df)

Check IO#

For convenience, you can also use the check_io() decorator where you can specify input and output schemas more concisely:

import pandas as pd
import pandera as pa

from pandera import DataFrameSchema, Column, Check, check_input


df = pd.DataFrame({
   "column1": [1, 4, 0, 10, 9],
   "column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
})

in_schema = DataFrameSchema({
   "column1": Column(int),
   "column2": Column(float),
})

out_schema = in_schema.add_columns({"column3": Column(float)})

@pa.check_io(df1=in_schema, df2=in_schema, out=out_schema)
def preprocessor(df1, df2):
    return (df1 + df2).assign(column3=lambda x: x.column1 + x.column2)

preprocessed_df = preprocessor(df, df)
print(preprocessed_df)
   column1  column2  column3
0        2     -2.6     -0.6
1        8     -2.8      5.2
2        0     -5.8     -5.8
3       20    -20.2     -0.2
4       18    -40.8    -22.8

Decorate Functions and Coroutines#

All pandera decorators work on synchronous as well as asynchronous code, on both bound and unbound functions/coroutines. For example, one can use the same decorators on:

  • sync/async functions

  • sync/async methods

  • sync/async class methods

  • sync/async static methods

All decorators work on sync/async regular/class/static methods of metaclasses as well.

import pandera as pa
from pandera.typing import DataFrame, Series

class Schema(pa.SchemaModel):
    col1: Series[int]

    class Config:
        strict = True

@pa.check_types
async def coroutine(df: DataFrame[Schema]) -> DataFrame[Schema]:
    return df

@pa.check_types
async def function(df: DataFrame[Schema]) -> DataFrame[Schema]:
    return df

class SomeClass:
    @pa.check_output(Schema.to_schema())
    async def regular_coroutine(self, df) -> DataFrame[Schema]:
        return df

    @classmethod
    @pa.check_input(Schema.to_schema(), "df")
    async def class_coroutine(cls, df):
        return Schema.validate(df)

    @staticmethod
    @pa.check_io(df=Schema.to_schema(), out=Schema.to_schema())
    def static_method(df):
        return df

Schema Inference#

New in version 0.4.0

With simple use cases, writing a schema definition manually is pretty straight-forward with pandera. However, it can get tedious to do this with dataframes that have many columns of various data types.

To help you handle these cases, the infer_schema() function enables you to quickly infer a draft schema from a pandas dataframe or series. Below is a simple example:

import pandas as pd
import pandera as pa

from pandera import Check, Column, DataFrameSchema

df = pd.DataFrame({
    "column1": [5, 10, 20],
    "column2": ["a", "b", "c"],
    "column3": pd.to_datetime(["2010", "2011", "2012"]),
})
schema = pa.infer_schema(df)
print(schema)
 <Schema DataFrameSchema(
     columns={
         'column1': <Schema Column(name=column1, type=DataType(int64))>
         'column2': <Schema Column(name=column2, type=DataType(object))>
         'column3': <Schema Column(name=column3, type=DataType(datetime64[ns]))>
     },
     checks=[],
     coerce=True,
     dtype=None,
     index=<Schema Index(name=None, type=DataType(int64))>,
     strict=False
     name=None,
     ordered=False,
     unique_column_names=False
 )>

These inferred schemas are rough drafts that shouldn’t be used for validation without modification. You can modify the inferred schema to obtain the schema definition that you’re satisfied with.

For DataFrameSchema objects, the following methods create modified copies of the schema:

For SeriesSchema objects:

  • set_checks()

The section below describes two workflows for persisting and modifying an inferred schema.

Schema Persistence#

The schema persistence feature requires a pandera installation with the io extension. See the installation instructions for more details.

There are two ways of persisting schemas, inferred or otherwise.

Write to a Python script#

You can also write your schema to a python script with to_script():

# supply a file-like object, Path, or str to write to a file. If not
# specified, to_script will output the code as a string.
schema_script = schema.to_script()
print(schema_script)
 from pandas import Timestamp
 from pandera import DataFrameSchema, Column, Check, Index, MultiIndex

 schema = DataFrameSchema(
     columns={
         "column1": Column(
             dtype=pandera.engines.numpy_engine.Int64,
             checks=[
                 Check.greater_than_or_equal_to(min_value=5.0),
                 Check.less_than_or_equal_to(max_value=20.0),
             ],
             nullable=False,
             unique=False,
             coerce=False,
             required=True,
             regex=False,
             description=None,
             title=None,
         ),
         "column2": Column(
             dtype=pandera.engines.numpy_engine.Object,
             checks=None,
             nullable=False,
             unique=False,
             coerce=False,
             required=True,
             regex=False,
             description=None,
             title=None,
         ),
         "column3": Column(
             dtype=pandera.engines.pandas_engine.DateTime,
             checks=[
                 Check.greater_than_or_equal_to(
                     min_value=Timestamp("2010-01-01 00:00:00")
                 ),
                 Check.less_than_or_equal_to(
                     max_value=Timestamp("2012-01-01 00:00:00")
                 ),
             ],
             nullable=False,
             unique=False,
             coerce=False,
             required=True,
             regex=False,
             description=None,
             title=None,
         ),
     },
     index=Index(
         dtype=pandera.engines.numpy_engine.Int64,
         checks=[
             Check.greater_than_or_equal_to(min_value=0.0),
             Check.less_than_or_equal_to(max_value=2.0),
         ],
         nullable=False,
         coerce=False,
         name=None,
         description=None,
         title=None,
     ),
     coerce=True,
     strict=False,
     name=None,
 )

As a python script, you can iterate on an inferred schema and use it to validate data once you are satisfied with your schema definition.

Write to YAML#

You can also write the schema object to a yaml file with to_yaml(), and you can then read it into memory with from_yaml(). The to_yaml() and from_yaml() is a convenience method for this functionality.

# supply a file-like object, Path, or str to write to a file. If not
# specified, to_yaml will output a yaml string.
yaml_schema = schema.to_yaml()
print(yaml_schema.replace(f"{pa.__version__}", "{PANDERA_VERSION}"))
 schema_type: dataframe
 version: {PANDERA_VERSION}
 columns:
   column1:
     title: null
     description: null
     dtype: int64
     nullable: false
     checks:
       greater_than_or_equal_to: 5.0
       less_than_or_equal_to: 20.0
     unique: false
     coerce: false
     required: true
     regex: false
   column2:
     title: null
     description: null
     dtype: object
     nullable: false
     checks: null
     unique: false
     coerce: false
     required: true
     regex: false
   column3:
     title: null
     description: null
     dtype: datetime64[ns]
     nullable: false
     checks:
       greater_than_or_equal_to: '2010-01-01 00:00:00'
       less_than_or_equal_to: '2012-01-01 00:00:00'
     unique: false
     coerce: false
     required: true
     regex: false
 checks: null
 index:
 - title: null
   description: null
   dtype: int64
   nullable: false
   checks:
     greater_than_or_equal_to: 0.0
     less_than_or_equal_to: 2.0
   name: null
   unique: false
   coerce: false
 coerce: true
 strict: false
 unique: null
 ordered: false

You can edit this yaml file to modify the schema. For example, you can specify new column names under the column key, and the respective values map onto key-word arguments in the Column class.

Note

Currently, only built-in Check methods are supported under the checks key.

Write to JSON#

Finally, you can also write the schema object to a json file with to_json(), and you can then read it into memory with from_json(). The to_json() and from_json() is a convenience method for this functionality.

# supply a file-like object, Path, or str to write to a file. If not
# specified, to_yaml will output a yaml string.
json_schema = schema.to_json(indent=4)
print(json_schema.replace(f"{pa.__version__}", "{PANDERA_VERSION}"))
 {
     "schema_type": "dataframe",
     "version": "{PANDERA_VERSION}",
     "columns": {
         "column1": {
             "title": null,
             "description": null,
             "dtype": "int64",
             "nullable": false,
             "checks": {
                 "greater_than_or_equal_to": 5.0,
                 "less_than_or_equal_to": 20.0
             },
             "unique": false,
             "coerce": false,
             "required": true,
             "regex": false
         },
         "column2": {
             "title": null,
             "description": null,
             "dtype": "object",
             "nullable": false,
             "checks": null,
             "unique": false,
             "coerce": false,
             "required": true,
             "regex": false
         },
         "column3": {
             "title": null,
             "description": null,
             "dtype": "datetime64[ns]",
             "nullable": false,
             "checks": {
                 "greater_than_or_equal_to": "2010-01-01 00:00:00",
                 "less_than_or_equal_to": "2012-01-01 00:00:00"
             },
             "unique": false,
             "coerce": false,
             "required": true,
             "regex": false
         }
     },
     "checks": null,
     "index": [
         {
             "title": null,
             "description": null,
             "dtype": "int64",
             "nullable": false,
             "checks": {
                 "greater_than_or_equal_to": 0.0,
                 "less_than_or_equal_to": 2.0
             },
             "name": null,
             "unique": false,
             "coerce": false
         }
     ],
     "coerce": true,
     "strict": false,
     "unique": null,
     "ordered": false
 }

You can edit this json file to update the schema as needed, and then load it back into a pandera schema object with from_json() or from_json().

Lazy Validation#

New in version 0.4.0

By default, when you call the validate method on schema or schema component objects, a SchemaError is raised as soon as one of the assumptions specified in the schema is falsified. For example, for a DataFrameSchema object, the following situations will raise an exception:

  • a column specified in the schema is not present in the dataframe.

  • if strict=True, a column in the dataframe is not specified in the schema.

  • the data type does not match.

  • if coerce=True, the dataframe column cannot be coerced into the specified data type.

  • the Check specified in one of the columns returns False or a boolean series containing at least one False value.

For example:

import pandas as pd
import pandera as pa

from pandera import Check, Column, DataFrameSchema

df = pd.DataFrame({"column": ["a", "b", "c"]})

schema = pa.DataFrameSchema({"column": Column(int)})
schema.validate(df)
Traceback (most recent call last):
...
SchemaError: expected series 'column' to have type int64, got object

For more complex cases, it is useful to see all of the errors raised during the validate call so that you can debug the causes of errors on different columns and checks. The lazy keyword argument in the validate method of all schemas and schema components gives you the option of doing just this:

import pandas as pd
import pandera as pa

from pandera import Check, Column, DataFrameSchema

schema = pa.DataFrameSchema(
    columns={
        "int_column": Column(int),
        "float_column": Column(float, Check.greater_than(0)),
        "str_column": Column(str, Check.equal_to("a")),
        "date_column": Column(pa.DateTime),
    },
    strict=True
)

df = pd.DataFrame({
    "int_column": ["a", "b", "c"],
    "float_column": [0, 1, 2],
    "str_column": ["a", "b", "d"],
    "unknown_column": None,
})

schema.validate(df, lazy=True)
Traceback (most recent call last):
...
pandera.errors.SchemaErrors: A total of 5 schema errors were found.

Error Counts
------------
- column_not_in_schema: 1
- column_not_in_dataframe: 1
- schema_component_check: 3

Schema Error Summary
--------------------
                                                         failure_cases  n_failure_cases
schema_context  column       check
DataFrameSchema <NA>         column_in_dataframe         [date_column]                1
                             column_in_schema         [unknown_column]                1
Column          float_column dtype('float64')                  [int64]                1
                int_column   dtype('int64')                   [object]                1
                str_column   equal_to(a)                        [b, d]                2

Usage Tip
---------

Directly inspect all errors by catching the exception:

```
try:
    schema.validate(dataframe, lazy=True)
except SchemaErrors as err:
    err.failure_cases  # dataframe of schema errors
    err.data  # invalid dataframe
```

As you can see from the output above, a SchemaErrors exception is raised with a summary of the error counts and failure cases caught by the schema. You can also see from the Usage Tip that you can catch these errors and inspect the failure cases in a more granular form:

try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
    print("Schema errors and failure cases:")
    print(err.failure_cases)
    print("\nDataFrame object that failed validation:")
    print(err.data)
Schema errors and failure cases:
    schema_context        column                check check_number  \
0  DataFrameSchema          None     column_in_schema         None
1  DataFrameSchema          None  column_in_dataframe         None
2           Column    int_column       dtype('int64')         None
3           Column  float_column     dtype('float64')         None
4           Column  float_column      greater_than(0)            0
5           Column    str_column          equal_to(a)            0
6           Column    str_column          equal_to(a)            0

     failure_case index
0  unknown_column  None
1     date_column  None
2          object  None
3           int64  None
4               0     0
5               b     1
6               d     2

DataFrame object that failed validation:
  int_column  float_column str_column unknown_column
0          a             0          a           None
1          b             1          b           None
2          c             2          d           None

Data Synthesis Strategies#

new in 0.6.0

pandera provides a utility for generating synthetic data purely from pandera schema or schema component objects. Under the hood, the schema metadata is collected to create a data-generating strategy using hypothesis, which is a property-based testing library.

Basic Usage#

Once you’ve defined a schema, it’s easy to generate examples:

import pandera as pa

schema = pa.DataFrameSchema(
    {
        "column1": pa.Column(int, pa.Check.eq(10)),
        "column2": pa.Column(float, pa.Check.eq(0.25)),
        "column3": pa.Column(str, pa.Check.eq("foo")),
    }
)
print(schema.example(size=3))
    column1  column2 column3
 0       10     0.25     foo
 1       10     0.25     foo
 2       10     0.25     foo

Note that here we’ve constrained the specific values in each column using Check s in order to make the data generation process deterministic for documentation purposes.

Usage in Unit Tests#

The example method is available for all schemas and schema components, and is primarily meant to be used interactively. It could be used in a script to generate test cases, but hypothesis recommends against doing this and instead using the strategy method to create a hypothesis strategy that can be used in pytest unit tests.

import hypothesis

def processing_fn(df):
    return df.assign(column4=df.column1 * df.column2)

@hypothesis.given(schema.strategy(size=5))
def test_processing_fn(dataframe):
    result = processing_fn(dataframe)
    assert "column4" in result

The above example is trivial, but you get the idea! Schema objects can create a strategy that can then be collected by a pytest runner. We could also run the tests explicitly ourselves, or run it as a unittest.TestCase. For more information on testing with hypothesis, see the hypothesis quick start guide.

A more practical example involves using schema transformations. We can modify the function above to make sure that processing_fn actually outputs the correct result:

out_schema = schema.add_columns({"column4": pa.Column(float)})

@pa.check_output(out_schema)
def processing_fn(df):
    return df.assign(column4=df.column1 * df.column2)

@hypothesis.given(schema.strategy(size=5))
def test_processing_fn(dataframe):
    processing_fn(dataframe)

Now the test_processing_fn simply becomes an execution test, raising a SchemaError if processing_fn doesn’t add column4 to the dataframe.

Strategies and Examples from Schema Models#

You can also use the class-based API to generate examples. Here’s the equivalent schema model for the above examples:

from pandera.typing import Series, DataFrame

class InSchema(pa.SchemaModel):
    column1: Series[int] = pa.Field(eq=10)
    column2: Series[float] = pa.Field(eq=0.25)
    column3: Series[str] = pa.Field(eq="foo")

class OutSchema(InSchema):
    column4: Series[float]

@pa.check_types
def processing_fn(df: DataFrame[InSchema]) -> DataFrame[OutSchema]:
    return df.assign(column4=df.column1 * df.column2)

@hypothesis.given(InSchema.strategy(size=5))
def test_processing_fn(dataframe):
    processing_fn(dataframe)

Checks as Constraints#

As you may have noticed in the first example, Check s further constrain the data synthesized from a strategy. Without checks, the example method would simply generate any value of the specified type. You can specify multiple checks on a column and pandera should be able to generate valid data under those constraints.

schema_multiple_checks = pa.DataFrameSchema({
    "column1": pa.Column(
        float, checks=[
            pa.Check.gt(0),
            pa.Check.lt(1e10),
            pa.Check.notin([-100, -10, 0]),
        ]
     )
})

for _ in range(5):
    # generate 10 rows of the dataframe
    sample_data = schema_multiple_checks.example(size=3)

    # validate the sampled data
    schema_multiple_checks(sample_data)

One caveat here is that it’s up to you to define a set of checks that are jointly satisfiable. If not, an Unsatisfiable exception will be raised:

schema_multiple_checks = pa.DataFrameSchema({
    "column1": pa.Column(
        float, checks=[
            # nonsensical constraints
            pa.Check.gt(0),
            pa.Check.lt(-10),
        ]
     )
})

schema_multiple_checks.example(size=3)
Traceback (most recent call last):
...
Unsatisfiable: Unable to satisfy assumptions of hypothesis example_generating_inner_function.
Check Strategy Chaining#

If you specify multiple checks for a particular column, this is what happens under the hood:

  • The first check in the list is the base strategy, which hypothesis uses to generate data.

  • All subsequent checks filter the values generated by the previous strategy such that it fulfills the constraints of current check.

To optimize efficiency of the data-generation procedure, make sure to specify the most restrictive constraint of a column as the base strategy and build other constraints on top of it.

In-line Custom Checks#

One of the strengths of pandera is its flexibility with regard to defining custom checks on the fly:

schema_inline_check = pa.DataFrameSchema({
    "col": pa.Column(str, pa.Check(lambda s: s.isin({"foo", "bar"})))
})

One of the disadvantages of this is that the fallback strategy is to simply apply the check to the generated data, which can be highly inefficient. In this case, hypothesis will generate strings and try to find examples of strings that are in the set {"foo", "bar"}, which will be very slow and most likely raise an Unsatisfiable exception. To get around this limitation, you can register custom checks and define strategies that correspond to them.

Defining Custom Strategies#

All built-in Check s are associated with a data synthesis strategy. You can define your own data synthesis strategies by using the extensions API to register a custom check function with a corresponding strategy.

Extensions#

new in 0.6.0

Registering Custom Check Methods#

One of the strengths of pandera is its flexibility in enabling you to defining in-line custom checks on the fly:

import pandera as pa

# checks elements in a column/dataframe
element_wise_check = pa.Check(lambda x: x < 0, element_wise=True)

# applies the check function to a dataframe/series
vectorized_check = pa.Check(lambda series_or_df: series_or_df < 0)

However, there are two main disadvantages of schemas with inline custom checks:

  1. they are not serializable with the IO interface.

  2. you can’t use them to synthesize data because the checks are not associated with a hypothesis strategy.

pandera now offers a way to register custom checks so that they’re available in the Check class as a check method. Here let’s define a custom method that checks whether a pandas object contains elements that lie within two values.

import pandera as pa
import pandera.extensions as extensions
import pandas as pd

@extensions.register_check_method(statistics=["min_value", "max_value"])
def is_between(pandas_obj, *, min_value, max_value):
    return (min_value <= pandas_obj) & (pandas_obj <= max_value)

schema = pa.DataFrameSchema({
    "col": pa.Column(int, pa.Check.is_between(min_value=1, max_value=10))
})

data = pd.DataFrame({"col": [1, 5, 10]})
print(schema(data))
   col
0    1
1    5
2   10

As you can see, a custom check’s first argument is a pandas series or dataframe by default (more on that later), followed by keyword-only arguments, specified with the * syntax.

The register_check_method() requires you to explicitly name the check statistics via the keyword argument, which are essentially the constraints placed by the check on the pandas data structure.

Specifying a Check Strategy#

To specify a check strategy with your custom check, you’ll need to install the strategies extension. First let’s look at a trivially simple example, where the check verifies whether a column is equal to a certain value:

def custom_equals(pandas_obj, *, value):
    return pandas_obj == value

The corresponding strategy for this check would be:

from typing import Optional
import hypothesis
import pandera.strategies as st

def equals_strategy(
    pandera_dtype: pa.DataType,
    strategy: Optional[st.SearchStrategy] = None,
    *,
    value,
):
    if strategy is None:
        return st.pandas_dtype_strategy(
            pandera_dtype, strategy=hypothesis.strategies.just(value),
        )
    return strategy.filter(lambda x: x == value)

As you may notice, the pandera strategy interface is has two arguments followed by keyword-only arguments that match the check function keyword-only check statistics. The pandera_dtype positional argument is useful for ensuring the correct data type. In the above example, we’re using the pandas_dtype_strategy() strategy to make sure the generated value is of the correct data type.

The optional strategy argument allows us to use the check strategy as a base strategy or a chained strategy. There’s a detail that we’re responsible for implementing in the strategy function body: we need to handle two cases to account for strategy chaining:

  1. when the strategy function is being used as a base strategy, i.e. when strategy is None

  2. when the strategy function is being chained from a previously-defined strategy, i.e. when strategy is not None.

Finally, to register the custom check with the strategy, use the register_check_method() decorator:

@extensions.register_check_method(
    statistics=["value"], strategy=equals_strategy
)
def custom_equals(pandas_obj, *, value):
    return pandas_obj == value

Let’s unpack what’s going in here. The custom_equals function only has a single statistic, which is the value argument, which we’ve also specified in register_check_method(). This means that the associated check strategy must match its keyword-only arguments.

Going back to our is_between function example, here’s what the strategy would look like:

def in_between_strategy(
    pandera_dtype: pa.DataType,
    strategy: Optional[st.SearchStrategy] = None,
    *,
    min_value,
    max_value
):
    if strategy is None:
        return st.pandas_dtype_strategy(
            pandera_dtype,
            min_value=min_value,
            max_value=max_value,
            exclude_min=False,
            exclude_max=False,
        )
    return strategy.filter(lambda x: min_value <= x <= max_value)

@extensions.register_check_method(
    statistics=["min_value", "max_value"],
    strategy=in_between_strategy,
)
def is_between_with_strat(pandas_obj, *, min_value, max_value):
    return (min_value <= pandas_obj) & (pandas_obj <= max_value)

Check Types#

The extensions module also supports registering element-wise and groupby checks.

Element-wise Checks#
@extensions.register_check_method(
    statistics=["val"],
    check_type="element_wise",
)
def element_wise_equal_check(element, *, val):
    return element == val

Note that the first argument of element_wise_equal_check is a single element in the column or dataframe.

Groupby Checks#

In this groupby check, we’re verifying that the values of one column for group_a are, on average, greater than those of group_b:

from typing import Dict

@extensions.register_check_method(
    statistics=["group_a", "group_b"],
    check_type="groupby",
)
def groupby_check(dict_groups: Dict[str, pd.Series], *, group_a, group_b):
    return dict_groups[group_a].mean() > dict_groups[group_b].mean()

data = pd.DataFrame({
    "values": [20, 10, 1, 15],
    "groups": list("xxyy"),
})

schema = pa.DataFrameSchema({
    "values": pa.Column(
        int,
        pa.Check.groupby_check(group_a="x", group_b="y", groupby="groups"),
    ),
    "groups": pa.Column(str),
})

print(schema(data))
   values groups
0      20      x
1      10      x
2       1      y
3      15      y

Registered Custom Checks with the Class-based API#

Since registered checks are part of the Check namespace, you can also use custom checks with the class-based API:

from pandera.typing import Series

class Schema(pa.SchemaModel):
    col1: Series[str] = pa.Field(custom_equals="value")
    col2: Series[int] = pa.Field(is_between={"min_value": 0, "max_value": 10})

data = pd.DataFrame({
    "col1": ["value"] * 5,
    "col2": range(5)
})

print(Schema.validate(data))
    col1  col2
0  value     0
1  value     1
2  value     2
3  value     3
4  value     4

DataFrame checks can be attached by using the Config class. Any field names that do not conflict with existing fields of BaseConfig and do not start with an underscore (_) are interpreted as the name of registered checks. If the value is a tuple or dict, it is interpreted as the positional or keyword arguments of the check, and as the first argument otherwise.

For example, to register zero, one, and two statistic dataframe checks one could do the following:

import pandera as pa
import pandera.extensions as extensions
import numpy as np
import pandas as pd


@extensions.register_check_method()
def is_small(df):
    return sum(df.shape) < 1000


@extensions.register_check_method(statistics=["fraction"])
def total_missing_fraction_less_than(df, *, fraction: float):
    return (1 - df.count().sum().item() / df.apply(len).sum().item()) < fraction


@extensions.register_check_method(statistics=["col_a", "col_b"])
def col_mean_a_greater_than_b(df, *, col_a: str, col_b: str):
    return df[col_a].mean() > df[col_b].mean()


from pandera.typing import Series


class Schema(pa.SchemaModel):
    col1: Series[float] = pa.Field(nullable=True, ignore_na=False)
    col2: Series[float] = pa.Field(nullable=True, ignore_na=False)

    class Config:
        is_small = ()
        total_missing_fraction_less_than = 0.6
        col_mean_a_greater_than_b = {"col_a": "col2", "col_b": "col1"}


data = pd.DataFrame({
    "col1": [float('nan')] * 3 + [0.5, 0.3, 0.1],
    "col2": np.arange(6.),
})

print(Schema.validate(data))
   col1  col2
0   NaN   0.0
1   NaN   1.0
2   NaN   2.0
3   0.5   3.0
4   0.3   4.0
5   0.1   5.0

Data Format Conversion#

new in 0.9.0

The class-based API provides configuration options for converting data to/from supported serialization formats in the context of check_types() -decorated functions.

Note

Currently, pandera.typing.pandas.DataFrame is the only data type that supports this feature.

Consider this simple example:

import pandera as pa
from pandera.typing import DataFrame, Series

class InSchema(pa.SchemaModel):
    str_col: Series[str] = pa.Field(unique=True, isin=[*"abcd"])
    int_col: Series[int]

class OutSchema(InSchema):
    float_col: pa.typing.Series[float]

@pa.check_types
def transform(df: DataFrame[InSchema]) -> DataFrame[OutSchema]:
    return df.assign(float_col=1.1)

With the schema type annotations and check_types() decorator, the transform function validates DataFrame inputs and outputs according to the InSchema and OutSchema definitions.

But what if your input data is serialized in parquet format, and you want to read it into memory, validate the DataFrame, and then pass it to a downstream function for further analysis? Similarly, what if you want the output of transform to be a list of dictionary records instead of a pandas DataFrame?

The to/from_format Configuration Options#

To easily fulfill the use cases described above, you can implement the read/write logic by hand, or you can configure schemas to do so. We can first define a subclass of InSchema with additional configuration so that our transform function can read data directly from parquet files or buffers:

class InSchemaParquet(InSchema):
    class Config:
        from_format = "parquet"

Then, we define subclass of OutSchema to specify that transform should output a list of dictionaries representing the rows of the output dataframe.

class OutSchemaDict(OutSchema):
    class Config:
        to_format = "dict"
        to_format_kwargs = {"orient": "records"}

Note that the {to/from}_format_kwargs configuration option should be supplied with a dictionary of key-word arguments to be passed into the respective pandas to_{format} method.

Finally, we redefine our transform function:

@pa.check_types
def transform(df: DataFrame[InSchemaParquet]) -> DataFrame[OutSchemaDict]:
    return df.assign(float_col=1.1)

We can test this out using a buffer to store the parquet file.

Note

A string or path-like object representing the filepath to a parquet file would also be a valid input to transform.

import io
import json

buffer = io.BytesIO()
data = pd.DataFrame({"str_col": [*"abc"], "int_col": range(3)})
data.to_parquet(buffer)
buffer.seek(0)

dict_output = transform(buffer)
print(json.dumps(dict_output, indent=4))
[
    {
        "str_col": "a",
        "int_col": 0,
        "float_col": 1.1
    },
    {
        "str_col": "b",
        "int_col": 1,
        "float_col": 1.1
    },
    {
        "str_col": "c",
        "int_col": 2,
        "float_col": 1.1
    }
]

Takeaway#

Data Format Conversion using the {to/from}_format configuration option can modify the behavior of check_types() -decorated functions to convert input data from a particular serialization format into a dataframe. Additionally, you can convert the output data from a dataframe to potentially another format.

This dovetails well with the FastAPI Integration for validating the inputs and outputs of app endpoints.

Supported DataFrame Libraries#

Pandera started out as a pandas-specific dataframe validation library, and moving forward its core functionality will continue to support pandas. However, pandera’s adoption has resulted in the realization that it can be a much more powerful tool by supporting other dataframe-like formats.

Domain-specific Data Validation#

The pandas ecosystem provides support for domain-specific data manipulation, and by extension pandera can provide access to data types, methods, and data container types specific to these libraries.

GeoPandas

An extension of pandas that adds geospatial data processing capabilities.

Data Validation with GeoPandas#

new in 0.9.0

GeoPandas is an extension of Pandas that adds support for geospatial data. You can use pandera to validate GeoDataFrame() and GeoSeries() objects directly. First, install pandera with the geopandas extra:

pip install pandera[geopandas]

Then you can use pandera schemas to validate geodataframes. In the example below we’ll use the class-based API to define a SchemaModel for validation.

import geopandas as gpd
import pandas as pd
import pandera as pa
from shapely.geometry import Polygon

geo_schema = pa.DataFrameSchema({
    "geometry": pa.Column("geometry"),
    "region": pa.Column(str),
})

geo_df = gpd.GeoDataFrame({
    "geometry": [
        Polygon(((0, 0), (0, 1), (1, 1), (1, 0))),
        Polygon(((0, 0), (0, -1), (-1, -1), (-1, 0)))
    ],
    "region": ["NA", "SA"]
})

print(geo_schema.validate(geo_df))
                                            geometry region
0  POLYGON ((0.00000 0.00000, 0.00000 1.00000, 1....     NA
1  POLYGON ((0.00000 0.00000, 0.00000 -1.00000, -...     SA

You can also use the GeometryDtype data type in either instantiated or un-instantiated form:

geo_schema = pa.DataFrameSchema({
    "geometry": pa.Column(gpd.array.GeometryDtype),
    # or
    "geometry": pa.Column(gpd.array.GeometryDtype()),
})

If you want to validate-on-instantiation, you can use the GeoDataFrame generic type with the schema model defined above:

from pandera.typing import Series
from pandera.typing.geopandas import GeoDataFrame, GeoSeries


class Schema(pa.SchemaModel):
    geometry: GeoSeries
    region: Series[str]


# create a geodataframe that's validated on object initialization
df = GeoDataFrame[Schema](
    {
        'geometry': [
            Polygon(((0, 0), (0, 1), (1, 1), (1, 0))),
            Polygon(((0, 0), (0, -1), (-1, -1), (-1, 0)))
        ],
        'region': ['NA','SA']
    }
)
print(df)
                                            geometry region
0  POLYGON ((0.00000 0.00000, 0.00000 1.00000, 1....     NA
1  POLYGON ((0.00000 0.00000, 0.00000 -1.00000, -...     SA

Scaling Up Data Validation#

Pandera provides multiple ways of scaling up data validation to dataframes that don’t fit into memory. Fortunately, pandera doesn’t have to re-invent the wheel. Standing on shoulders of giants, it integrates with the existing ecosystem of libraries that allow you to perform validations on out-of-memory dataframes.

Dask

Apply pandera schemas to Dask dataframe partitions.

Fugue

Apply pandera schemas to distributed dataframe partitions with Fugue.

Koalas [Deprecated]

A pandas drop-in replacement, distributed using a Spark backend.

Pyspark

Exposes a pyspark.pandas module, distributed using a Spark backend.

Modin

A pandas drop-in replacement, distributed using a Ray or Dask backend.

Data Validation with Dask#

new in 0.8.0

Dask is a distributed compute framework that offers a pandas-like dataframe API. You can use pandera to validate DataFrame() and Series() objects directly. First, install pandera with the dask extra:

pip install pandera[dask]

Then you can use pandera schemas to validate dask dataframes. In the example below we’ll use the class-based API to define a SchemaModel for validation.

import dask.dataframe as dd
import pandas as pd
import pandera as pa

from pandera.typing.dask import DataFrame, Series


class Schema(pa.SchemaModel):
    state: Series[str]
    city: Series[str]
    price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})


ddf = dd.from_pandas(
    pd.DataFrame(
        {
            'state': ['FL','FL','FL','CA','CA','CA'],
            'city': [
                'Orlando',
                'Miami',
                'Tampa',
                'San Francisco',
                'Los Angeles',
                'San Diego',
            ],
            'price': [8, 12, 10, 16, 20, 18],
        }
    ),
    npartitions=2
)
pandera_ddf = Schema(ddf)

print(pandera_ddf)
Dask DataFrame Structure:
                state    city  price
npartitions=2
0              object  object  int64
3                 ...     ...    ...
5                 ...     ...    ...
Dask Name: validate, 2 graph layers

As you can see, passing the dask dataframe into Schema will produce another dask dataframe which hasn’t been evaluated yet. What this means is that pandera will only validate when the dask graph is evaluated.

print(pandera_ddf.compute())
  state           city  price
0    FL        Orlando      8
1    FL          Miami     12
2    FL          Tampa     10
3    CA  San Francisco     16
4    CA    Los Angeles     20
5    CA      San Diego     18

You can also use the check_types() decorator to validate dask dataframes at runtime:

@pa.check_types
def function(ddf: DataFrame[Schema]) -> DataFrame[Schema]:
    return ddf[ddf["state"] == "CA"]

print(function(ddf).compute())
  state           city  price
3    CA  San Francisco     16
4    CA    Los Angeles     20
5    CA      San Diego     18

And of course, you can use the object-based API to validate dask dataframes:

schema = pa.DataFrameSchema({
    "state": pa.Column(str),
    "city": pa.Column(str),
    "price": pa.Column(int, pa.Check.in_range(min_value=5, max_value=20))
})
print(schema(ddf).compute())
  state           city  price
0    FL        Orlando      8
1    FL          Miami     12
2    FL          Tampa     10
3    CA  San Francisco     16
4    CA    Los Angeles     20
5    CA      San Diego     18
Data Validation with Fugue#

Validation on big data comes in two forms. The first is performing one set of validations on data that doesn’t fit in memory. The second happens when a large dataset is comprised of multiple groups that require different validations. In pandas semantics, this would be the equivalent of a groupby-validate operation. This section will cover using pandera for both of these scenarios.

Pandera has support for Spark and Dask DataFrames through Modin and PySpark Pandas. Another option for running pandera on top of native Spark or Dask engines is Fugue . Fugue is an open source abstraction layer that ports Python, pandas, and SQL code to Spark and Dask. Operations will be applied on DataFrames natively, minimizing overhead.

What is Fugue?#

Fugue serves as an interface to distributed computing. Because of its non-invasive design, existing Python code can be scaled to a distributed setting without significant changes.

To run the example, Fugue needs to installed separately. Using pip:

pip install fugue[spark]

This will also install PySpark because of the spark extra. Dask is available with the dask extra.

Example#

In this example, a pandas DataFrame is created with state, city and price columns. Pandera will be used to validate that the price column values are within a certain range.

import pandas as pd

data = pd.DataFrame(
    {
        'state': ['FL','FL','FL','CA','CA','CA'],
        'city': [
            'Orlando', 'Miami', 'Tampa', 'San Francisco', 'Los Angeles', 'San Diego'
        ],
        'price': [8, 12, 10, 16, 20, 18],
    }
)
print(data)
  state           city  price
0    FL        Orlando      8
1    FL          Miami     12
2    FL          Tampa     10
3    CA  San Francisco     16
4    CA    Los Angeles     20
5    CA      San Diego     18

Validation is then applied using pandera. A price_validation function is created that runs the validation. None of this will be new.

from pandera import Column, DataFrameSchema, Check

price_check = DataFrameSchema(
    {"price": Column(int, Check.in_range(min_value=5,max_value=20))}
)

def price_validation(data:pd.DataFrame) -> pd.DataFrame:
    return price_check.validate(data)

The transform function in Fugue is the easiest way to use Fugue with existing Python functions as seen in the following code snippet. The first two arguments are the DataFrame and function to apply. The keyword argument schema is required because schema is strictly enforced in distributed settings. Here, the schema is simply * because no new columns are added.

The last part of the transform function is the engine. Here, a SparkSession object is used to run the code on top of Spark. For Dask, users can pass a string "dask" or can pass a Dask Client. Passing nothing uses the default pandas-based engine. Because we passed a SparkSession in this example, the output is a Spark DataFrame.

from fugue import transform
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
spark_df = transform(data, price_validation, schema="*", engine=spark)
spark_df.show()
+-----+-------------+-----+
|state|         city|price|
+-----+-------------+-----+
|   FL|      Orlando|    8|
|   FL|        Miami|   12|
|   FL|        Tampa|   10|
|   CA|San Francisco|   16|
|   CA|  Los Angeles|   20|
|   CA|    San Diego|   18|
+-----+-------------+-----+
Validation by Partition#

There is an interesting use case that arises with bigger datasets. Frequently, there are logical groupings of data that require different validations. In the earlier sample data, the price range for the records with state FL is lower than the range for the state CA. Two DataFrameSchema will be created to reflect this. Notice their ranges for the Check differ.

price_check_FL = DataFrameSchema({
    "price": Column(int, Check.in_range(min_value=7,max_value=13)),
})

price_check_CA = DataFrameSchema({
    "price": Column(int, Check.in_range(min_value=15,max_value=21)),
})

price_checks = {'CA': price_check_CA, 'FL': price_check_FL}

A slight modification is needed to our price_validation function. Fugue will partition the whole dataset into multiple pandas DataFrames. Think of this as a groupby. By the time price_validation is used, it only contains the data for one state. The appropriate DataFrameSchema is pulled and then applied.

To partition our data by state, all we need to do is pass it into the transform function through the partition argument. This splits up the data across different workers before they each run the price_validation function. Again, this is like a groupby-validation.

def price_validation(df:pd.DataFrame) -> pd.DataFrame:
    location = df['state'].iloc[0]
    check = price_checks[location]
    check.validate(df)
    return df

spark_df = transform(data,
          price_validation,
          schema="*",
          partition=dict(by="state"),
          engine=spark)

spark_df.show()
SparkDataFrame
state:str|city:str                                                 |price:long
---------+---------------------------------------------------------+----------
CA       |San Francisco                                            |16
CA       |Los Angeles                                              |20
CA       |San Diego                                                |18
FL       |Orlando                                                  |8
FL       |Miami                                                    |12
FL       |Tampa                                                    |10
Total count: 6

Note

Because operations in a distributed setting are applied per partition, statistical validators will be applied on each partition rather than the global dataset. If no partitioning scheme is specified, Spark and Dask use default partitions. Be careful about using operations like mean, min, and max without partitioning beforehand.

All row-wise validations scale well with this set-up.

Returning Errors#

Pandera will raise a SchemaError by default that gets buried by the Spark error messages. To return the errors as a DataFrame, we use can use the following approach. If there are no errors in the data, it will just return an empty DataFrame.

To keep the errors for each partition, you can attach the partition key as a column in the returned DataFrame.

from pandera.errors import SchemaErrors

out_schema = "schema_context:str, column:str, check:str, \
check_number:int, failure_case:str, index:int"

out_columns = ["schema_context", "column", "check",
"check_number", "failure_case", "index"]

price_check = DataFrameSchema(
    {"price": Column(int, Check.in_range(min_value=12,max_value=20))}
)

def price_validation(data:pd.DataFrame) -> pd.DataFrame:
    try:
        price_check.validate(data, lazy=True)
        return pd.DataFrame(columns=out_columns)
    except SchemaErrors as err:
        return err.failure_cases

transform(data, price_validation, schema=out_schema, engine=spark).show()
+--------------+------+----------------+------------+------------+-----+
|schema_context|column|           check|check_number|failure_case|index|
+--------------+------+----------------+------------+------------+-----+
|        Column| price|in_range(12, 20)|           0|           8|    0|
|        Column| price|in_range(12, 20)|           0|          10|    0|
+--------------+------+----------------+------------+------------+-----+
Data Validation with Koalas#

Note

Koalas has been deprecated since version 0.10.0. Please refer to the pyspark page for validating pyspark dataframes.

Data Validation with Pyspark ⭐️ (New)#

new in 0.10.0

Pyspark is a distributed compute framework that offers a pandas drop-in replacement dataframe implementation via the pyspark.pandas API . You can use pandera to validate DataFrame() and Series() objects directly. First, install pandera with the pyspark extra:

pip install pandera[pyspark]

Then you can use pandera schemas to validate pyspark dataframes. In the example below we’ll use the class-based API to define a SchemaModel for validation.

import pyspark.pandas as ps
import pandas as pd
import pandera as pa

from pandera.typing.pyspark import DataFrame, Series


class Schema(pa.SchemaModel):
    state: Series[str]
    city: Series[str]
    price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})


# create a pyspark.pandas dataframe that's validated on object initialization
df = DataFrame[Schema](
    {
        'state': ['FL','FL','FL','CA','CA','CA'],
        'city': [
            'Orlando',
            'Miami',
            'Tampa',
            'San Francisco',
            'Los Angeles',
            'San Diego',
        ],
        'price': [8, 12, 10, 16, 20, 18],
    }
)
print(df)
  state           city  price
0    FL        Orlando      8
1    FL          Miami     12
2    FL          Tampa     10
3    CA  San Francisco     16
4    CA    Los Angeles     20
5    CA      San Diego     18

You can also use the check_types() decorator to validate pyspark pandas dataframes at runtime:

@pa.check_types
def function(df: DataFrame[Schema]) -> DataFrame[Schema]:
    return df[df["state"] == "CA"]

print(function(df))
  state           city  price
3    CA  San Francisco     16
4    CA    Los Angeles     20
5    CA      San Diego     18

And of course, you can use the object-based API to validate dask dataframes:

schema = pa.DataFrameSchema({
    "state": pa.Column(str),
    "city": pa.Column(str),
    "price": pa.Column(int, pa.Check.in_range(min_value=5, max_value=20))
})
print(schema(df))
  state           city  price
0    FL        Orlando      8
1    FL          Miami     12
2    FL          Tampa     10
3    CA  San Francisco     16
4    CA    Los Angeles     20
5    CA      San Diego     18
Data Validation with Modin#

new in 0.8.0

Modin is a distributed compute framework that offers a pandas drop-in replacement dataframe implementation. You can use pandera to validate DataFrame() and Series() objects directly. First, install pandera with the dask extra:

pip install pandera[modin]       # installs both ray and dask backends
pip install pandera[modin-ray]   # only ray backend
pip install pandera[modin-dask]  # only dask backend

Then you can use pandera schemas to validate modin dataframes. In the example below we’ll use the class-based API to define a SchemaModel for validation.

import modin.pandas as pd
import pandas as pd
import pandera as pa

from pandera.typing.modin import DataFrame, Series


class Schema(pa.SchemaModel):
    state: Series[str]
    city: Series[str]
    price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})


# create a modin dataframe that's validated on object initialization
df = DataFrame[Schema](
    {
        'state': ['FL','FL','FL','CA','CA','CA'],
        'city': [
            'Orlando',
            'Miami',
            'Tampa',
            'San Francisco',
            'Los Angeles',
            'San Diego',
        ],
        'price': [8, 12, 10, 16, 20, 18],
    }
)
print(df)
  state           city  price
0    FL        Orlando      8
1    FL          Miami     12
2    FL          Tampa     10
3    CA  San Francisco     16
4    CA    Los Angeles     20
5    CA      San Diego     18

You can also use the check_types() decorator to validate modin dataframes at runtime:

@pa.check_types
def function(df: DataFrame[Schema]) -> DataFrame[Schema]:
    return df[df["state"] == "CA"]

print(function(df))
  state           city  price
3    CA  San Francisco     16
4    CA    Los Angeles     20
5    CA      San Diego     18

And of course, you can use the object-based API to validate dask dataframes:

schema = pa.DataFrameSchema({
    "state": pa.Column(str),
    "city": pa.Column(str),
    "price": pa.Column(int, pa.Check.in_range(min_value=5, max_value=20))
})
print(schema(df))
  state           city  price
0    FL        Orlando      8
1    FL          Miami     12
2    FL          Tampa     10
3    CA  San Francisco     16
4    CA    Los Angeles     20
5    CA      San Diego     18

Note

Don’t see a library that you want supported? Check out the github issues to see if that library is in the roadmap. If it isn’t, open up a new issue to add support for it!

Integrations#

Pandera ships with integrations with other tools in the Python ecosystem, with the goal of interoperating with libraries that you know and love.

FastAPI

Use pandera SchemaModels in your FastAPI app

Frictionless

Convert frictionless schemas to pandera schemas

Hypothesis

Use the hypothesis library to generate valid data under your schema’s constraints.

Mypy

Type-lint your pandas and pandera code with mypy for static type safety [experimental 🧪]

Pydantic

Use pandera SchemaModels when defining your pydantic BaseModels

FastAPI#

new in 0.9.0

Since both FastAPI and Pandera integrates seamlessly with Pydantic, you can use the SchemaModel types to validate incoming or outgoing data with respect to your API endpoints.

Using SchemaModels to Validate Endpoint Inputs and Outputs#

Suppose we want to process transactions, where each transaction has an id and cost. We can model this with a pandera schema model:

# pylint: skip-file
from typing import Optional

from pydantic import BaseModel, Field

import pandera as pa


class Transactions(pa.SchemaModel):
    id: pa.typing.Series[int]
    cost: pa.typing.Series[float] = pa.Field(ge=0, le=1000)

    class Config:
        coerce = True

Also suppose that we expect our endpoint to add a name to the transaction data:

class TransactionsOut(Transactions):
    id: pa.typing.Series[int]
    cost: pa.typing.Series[float]
    name: pa.typing.Series[str]

Let’s also assume that the output of the endpoint should be a list of dictionary records containing the named transactions data. We can do this easily with the to_format option in the schema model BaseConfig.

class TransactionsDictOut(TransactionsOut):
    class Config:
        to_format = "dict"
        to_format_kwargs = {"orient": "records"}

Note that the to_format_kwargs is a dictionary of key-word arguments to be passed into the respective pandas to_{format} method.

Next we’ll create a FastAPI app and define a /transactions/ POST endpoint:

from fastapi import FastAPI, File
from pandera.typing import DataFrame

app = FastAPI()

@app.post("/transactions/", response_model=DataFrame[TransactionsDictOut])
def create_transactions(transactions: DataFrame[Transactions]):
    output = transactions.assign(name="foo")
    ...  # do other stuff, e.g. update backend database with transactions
    return output
Reading File Uploads#

Similar to the TransactionsDictOut example to convert dataframes to a particular format as an endpoint response, pandera also provides a from_format schema model configuration option to read a dataframe from a particular serialization format.

class TransactionsParquet(Transactions):
    class Config:
        from_format = "parquet"

Let’s also define a response model for the /file/ upload endpoint:

class TransactionsJsonOut(TransactionsOut):
    class Config:
        to_format = "json"
        to_format_kwargs = {"orient": "records"}

class ResponseModel(BaseModel):
    filename: str
    df: pa.typing.DataFrame[TransactionsJsonOut]

In the next example, we use the pandera UploadFile type to upload a parquet file to the /file/ POST endpoint and return a response containing the filename and the modified data in json format.

from pandera.typing.fastapi import UploadFile

@app.post("/file/", response_model=ResponseModel)
def create_upload_file(
    file: UploadFile[DataFrame[TransactionsParquet]] = File(...),
):
    return {
        "filename": file.filename,
        "df": file.data.assign(name="foo"),
    }

Pandera’s UploadFile type is a subclass of FastAPI’s UploadFile but it exposes a .data property containing the pandera-validated dataframe.

Takeaway#

With the FastAPI and Pandera integration, you can use Pandera SchemaModel types to validate the dataframe inputs and outputs of your FastAPI endpoints.

Reading Third-Party Schema#

new in 0.7.0

Pandera now accepts schema from other data validation frameworks. This requires a pandera installation with the io extension; please see the installation instructions for more details.

Frictionless Data Schema#

Note

Please see the Frictionless schema documentation for more information on this standard.

pandera.io.from_frictionless_schema(schema)[source]#

Create a DataFrameSchema from either a frictionless json/yaml schema file saved on disk, or from a frictionless schema already loaded into memory.

Each field from the frictionless schema will be converted to a pandera column specification using FrictionlessFieldParser to map field characteristics to pandera column specifications.

Parameters

schema (Union[str, Path, Dict, Schema]) – the frictionless schema object (or a string/Path to the location on disk of a schema specification) to parse.

Return type

DataFrameSchema

Returns

dataframe schema with frictionless field specs converted to pandera column checks and constraints for use as normal.

Example

Here, we’re defining a very basic frictionless schema in memory before parsing it and then querying the resulting DataFrameSchema object as per any other Pandera schema:

>>> from pandera.io import from_frictionless_schema
>>>
>>> FRICTIONLESS_SCHEMA = {
...     "fields": [
...         {
...             "name": "column_1",
...             "type": "integer",
...             "constraints": {"minimum": 10, "maximum": 99}
...         },
...         {
...             "name": "column_2",
...             "type": "string",
...             "constraints": {"maxLength": 10, "pattern": "\S+"}
...         },
...     ],
...     "primaryKey": "column_1"
... }
>>> schema = from_frictionless_schema(FRICTIONLESS_SCHEMA)
>>> schema.columns["column_1"].checks
[<Check in_range: in_range(10, 99)>]
>>> schema.columns["column_1"].required
True
>>> schema.columns["column_1"].unique
True
>>> schema.columns["column_2"].checks
[<Check str_length: str_length(None, 10)>, <Check str_matches: str_matches(re.compile('^\\S+$'))>]

under the hood, this uses the FrictionlessFieldParser class to parse each frictionless field (column):

class pandera.io.FrictionlessFieldParser(field, primary_keys)[source]#

Parses frictionless data schema field specifications so we can convert them to an equivalent Pandera Column schema.

For this implementation, we are using field names, constraints and types but leaving other frictionless parameters out (e.g. foreign keys, type formats, titles, descriptions).

Parameters
  • field – a field object from a frictionless schema.

  • primary_keys – the primary keys from a frictionless schema. These are used to ensure primary key fields are treated properly - no duplicates, no missing values etc.

property checks: Optional[Dict]#

Convert a set of frictionless schema field constraints into checks.

This parses the standard set of frictionless constraints which can be found here and maps them into the equivalent pandera checks.

Return type

Optional[Dict]

Returns

a dictionary of pandera Check objects which capture the standard constraint logic of a frictionless schema field.

property coerce: bool#

Determine whether values within this field should be coerced.

This currently returns True for all fields within a frictionless schema.

Return type

bool

property dtype: str#

Determine what type of field this is, so we can feed that into DataType. If no type is specified in the frictionless schema, we default to string values.

Return type

str

Returns

the pandas-compatible representation of this field type as a string.

property nullable: bool#

Determine whether this field can contain missing values.

If a field is a primary key, this will return False.

Return type

bool

property regex: bool#

Determine whether this field name should be used for regex matches.

This currently returns False for all fields within a frictionless schema.

Return type

bool

property required: bool#

Determine whether this field must exist within the data.

This currently returns True for all fields within a frictionless schema.

Return type

bool

to_pandera_column()[source]#

Export this field to a column spec dictionary.

Return type

Dict

property unique: bool#

Determine whether this field can contain duplicate values.

If a field is a primary key, this will return True.

Return type

bool

Mypy#

new in 0.8.0

Pandera integrates with mypy to provide static type-linting of dataframes, relying on pandas-stubs for typing information.

pip install pandera[mypy]

Then enable the plugin in your mypy.ini or setug.cfg file:

[mypy]
plugins = pandera.mypy

Note

Mypy static type-linting is supported for only pandas dataframes.

Warning

This functionality is experimental 🧪. Since the pandas-stubs type stub annotations don’t always match the official pandas effort to support type annotations), installing the `pandera[mypy] extra may yield false positives in your pandas code, many of which are are documented in tests/mypy/modules.

We encourage beta users to file an issue if they find any false positives or negatives being reported by mypy. A list of such issues can be found here.

In the example below, we define a few schemas to see how type-linting with pandera works.

from typing import cast

import pandas as pd

import pandera as pa
from pandera.typing import DataFrame, Series


class Schema(pa.SchemaModel):
    id: Series[int]
    name: Series[str]


class SchemaOut(pa.SchemaModel):
    age: Series[int]


class AnotherSchema(pa.SchemaModel):
    id: Series[int]
    first_name: Series[str]

The mypy linter will complain if the output type of the function body doesn’t match the function’s return signature.

def fn(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    return df.assign(age=30).pipe(DataFrame[SchemaOut])  # mypy okay


def fn_pipe_incorrect_type(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    return df.assign(age=30).pipe(DataFrame[AnotherSchema])  # mypy error
    # error: Argument 1 to "pipe" of "NDFrame" has incompatible type "Type[DataFrame[Any]]";  # noqa
    # expected "Union[Callable[..., DataFrame[SchemaOut]], Tuple[Callable[..., DataFrame[SchemaOut]], str]]"  [arg-type]  # noqa


def fn_assign_copy(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    return df.assign(age=30)  # mypy error
    # error: Incompatible return value type (got "pandas.core.frame.DataFrame",
    # expected "pandera.typing.pandas.DataFrame[SchemaOut]")  [return-value]

It’ll also complain if the input type doesn’t match the expected input type. Note that we’re using the pandera.typing.pandas.DataFrame generic type to define dataframes that are validated against the SchemaModel type variable on initialization.

schema_df = DataFrame[Schema]({"id": [1], "name": ["foo"]})
pandas_df = pd.DataFrame({"id": [1], "name": ["foo"]})
another_df = DataFrame[AnotherSchema]({"id": [1], "first_name": ["foo"]})


fn(schema_df)  # mypy okay

fn(pandas_df)  # mypy error
# error: Argument 1 to "fn" has incompatible type "pandas.core.frame.DataFrame";  # noqa
# expected "pandera.typing.pandas.DataFrame[Schema]"  [arg-type]

fn(another_df)  # mypy error
# error: Argument 1 to "fn" has incompatible type "DataFrame[AnotherSchema]";
# expected "DataFrame[Schema]"  [arg-type]

To make mypy happy with respect to the return type, you can either initialize a dataframe of the expected type:

def fn_pipe_dataframe(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    return df.assign(age=30).pipe(DataFrame[SchemaOut])  # mypy okay

Note

If you use the approach above with the check_types() decorator, pandera will do its best to not to validate the dataframe twice if it’s already been initialized with the DataFrame[Schema](**data) syntax.

Or use typing.cast() to indicate to mypy that the return value of the function is of the correct type.

def fn_cast_dataframe(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    return cast(DataFrame[SchemaOut], df.assign(age=30))  # mypy okay
Limitations#

An important caveat to static type-linting with pandera dataframe types is that, since pandas dataframes are mutable objects, there’s no way for mypy to know whether a mutated instance of a SchemaModel-typed dataframe has the correct contents. Fortunately, we can simply rely on the check_types() decorator to verify that the output dataframe is valid.

Consider the examples below:

def fn_pipe_dataframe(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    return df.assign(age=30).pipe(DataFrame[SchemaOut])  # mypy okay


def fn_cast_dataframe(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    return cast(DataFrame[SchemaOut], df.assign(age=30))  # mypy okay


@pa.check_types
def fn_mutate_inplace(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    out = df.assign(age=30).pipe(DataFrame[SchemaOut])
    out.drop(["age"], axis=1, inplace=True)
    return out  # okay for mypy, pandera raises error


@pa.check_types
def fn_assign_and_get_index(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    return df.assign(foo=30).iloc[:3]  # okay for mypy, pandera raises error

Even though the outputs of these functions are incorrect, mypy doesn’t catch the error during static type-linting but pandera will raise a SchemaError or SchemaErrors exception at runtime, depending on whether you’re doing lazy validation or not.

@pa.check_types
def fn_cast_dataframe_invalid(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    return cast(
        DataFrame[SchemaOut], df
    )  # okay for mypy, pandera raises error

Pydantic#

new in 0.8.0

Using Pandera Schemas in Pydantic Models#

SchemaModel is fully compatible with pydantic. You can specify a SchemaModel in a pydantic BaseModel as you would any other field:

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
import pydantic


class SimpleSchema(pa.SchemaModel):
    str_col: Series[str] = pa.Field(unique=True)


class PydanticModel(pydantic.BaseModel):
    x: int
    df: DataFrame[SimpleSchema]


valid_df = pd.DataFrame({"str_col": ["hello", "world"]})
PydanticModel(x=1, df=valid_df)

invalid_df = pd.DataFrame({"str_col": ["hello", "hello"]})
PydanticModel(x=1, df=invalid_df)
Traceback (most recent call last):
...
ValidationError: 1 validation error for PydanticModel
df
series 'str_col' contains duplicate values:
1    hello
Name: str_col, dtype: object (type=value_error)

Other pandera components are also compatible with pydantic:

Note

The SeriesSchema, DataFrameSchema and schema_components types validates the type of a schema object, e.g. if your pydantic BaseModel contained a schema object, not a pandas object.

Using Pydantic Models in Pandera Schemas#

new in 0.10.0

You can also use a pydantic BaseModel in a pandera schema. Suppose you had a Record model:

from pydantic import BaseModel

import pandera as pa


class Record(BaseModel):
    name: str
    xcoord: str
    ycoord: int

The PydanticModel datatype enables you to specify the Record model as a row-wise type.

import pandas as pd
from pandera.engines.pandas_engine import PydanticModel


class PydanticSchema(pa.SchemaModel):
    """Pandera schema using the pydantic model."""

    class Config:
        """Config with dataframe-level data type."""

        dtype = PydanticModel(Record)
        coerce = True  # this is required, otherwise a SchemaInitError is raised

Note

By combining dtype=PydanticModel(...) and coerce=True, pandera will apply the pydantic model validation process to each row of the dataframe, converting the model back to a dictionary with the BaseModel.dict() method.

The equivalent pandera schema would look like this:

class PanderaSchema(pa.SchemaModel):
    """Pandera schema that's equivalent to PydanticSchema."""

    name: pa.typing.Series[str]
    xcoord: pa.typing.Series[int]
    ycoord: pa.typing.Series[int]

Note

Since the PydanticModel datatype applies the BaseModel constructor to each row of the dataframe, using PydanticModel might not scale well with larger datasets.

If you want to help benchmark, consider contributing a benchmark script

Note

Don’t see a library that you want supported? Check out the github issues to see if that library is in the roadmap. If it isn’t, open up a new issue to add support for it!

API#

Core

The core objects for defining pandera schemas

Data Types

Data types for type checking and coercion.

Schema Models

Alternative class-based API for defining pandera schemas.

Decorators

Decorators for integrating pandera schemas with python functions.

Schema Inference

Bootstrap schemas from real data

IO Utilities

Utility functions for reading/writing schemas

Data Synthesis Strategies

Module of functions for generating data from schemas.

Extensions

Utility functions for extending pandera functionality

Errors

Pandera-specific exceptions

Core#

Schemas#

pandera.schemas.DataFrameSchema

A light-weight pandas DataFrame validator.

pandera.schemas.SeriesSchema

Series validator.

Schema Components#

pandera.schema_components.Column

Validate types and properties of DataFrame columns.

pandera.schema_components.Index

Validate types and properties of a DataFrame Index.

pandera.schema_components.MultiIndex

Validate types and properties of a DataFrame MultiIndex.

Checks#

pandera.checks.Check

Check a pandas Series or DataFrame for certain properties.

pandera.hypotheses.Hypothesis

Special type of Check that defines hypothesis tests on data.

Data Types#

Library-agnostic dtypes#

pandera.dtypes.DataType

Base class of all Pandera data types.

pandera.dtypes.Bool

Semantic representation of a boolean data type.

pandera.dtypes.Timestamp

Semantic representation of a timestamp data type.

pandera.dtypes.DateTime

alias of pandera.dtypes.Timestamp

pandera.dtypes.Timedelta

Semantic representation of a delta time data type.

pandera.dtypes.Category

Semantic representation of a categorical data type.

pandera.dtypes.Float

Semantic representation of a floating data type.

pandera.dtypes.Float16

Semantic representation of a floating data type stored in 16 bits.

pandera.dtypes.Float32

Semantic representation of a floating data type stored in 32 bits.

pandera.dtypes.Float64

Semantic representation of a floating data type stored in 64 bits.

pandera.dtypes.Float128

Semantic representation of a floating data type stored in 128 bits.

pandera.dtypes.Int

Semantic representation of an integer data type.

pandera.dtypes.Int8

Semantic representation of an integer data type stored in 8 bits.

pandera.dtypes.Int16

Semantic representation of an integer data type stored in 16 bits.

pandera.dtypes.Int32

Semantic representation of an integer data type stored in 32 bits.

pandera.dtypes.Int64

Semantic representation of an integer data type stored in 64 bits.

pandera.dtypes.UInt

Semantic representation of an unsigned integer data type.

pandera.dtypes.UInt8

Semantic representation of an unsigned integer data type stored in 8 bits.

pandera.dtypes.UInt16

Semantic representation of an unsigned integer data type stored in 16 bits.

pandera.dtypes.UInt32

Semantic representation of an unsigned integer data type stored in 32 bits.

pandera.dtypes.UInt64

Semantic representation of an unsigned integer data type stored in 64 bits.

pandera.dtypes.Complex

Semantic representation of a complex number data type.

pandera.dtypes.Complex64

Semantic representation of a complex number data type stored in 64 bits.

pandera.dtypes.Complex128

Semantic representation of a complex number data type stored in 128 bits.

pandera.dtypes.Complex256

Semantic representation of a complex number data type stored in 256 bits.

pandera.dtypes.Decimal

Semantic representation of a decimal data type.

pandera.dtypes.String

Semantic representation of a string data type.

Pandas Dtypes#

Listed here for compatibility with pandera versions < 0.7. Passing native pandas dtypes to pandera components is preferred.

GeoPandas Dtypes#

new in 0.9.0

Pydantic Dtypes#

new in 0.10.0

pandera.engines.pandas_engine.PydanticModel

A pydantic model datatype applying to rows in a dataframe.

Utility functions#

pandera.dtypes.is_subdtype

Returns True if first argument is lower/equal in DataType hierarchy.

pandera.dtypes.is_float

Return True if pandera.dtypes.DataType is a float.

pandera.dtypes.is_int

Return True if pandera.dtypes.DataType is an integer.

pandera.dtypes.is_uint

Return True if pandera.dtypes.DataType is an unsigned integer.

pandera.dtypes.is_complex

Return True if pandera.dtypes.DataType is a complex number.

pandera.dtypes.is_numeric

Return True if pandera.dtypes.DataType is a complex number.

pandera.dtypes.is_bool

Return True if pandera.dtypes.DataType is a boolean.

pandera.dtypes.is_string

Return True if pandera.dtypes.DataType is a string.

pandera.dtypes.is_datetime

Return True if pandera.dtypes.DataType is a datetime.

pandera.dtypes.is_timedelta

Return True if pandera.dtypes.DataType is a timedelta.

pandera.dtypes.immutable

dataclasses.dataclass() decorator with different default values: frozen=True, init=False, repr=False.

Engines#

pandera.engines.engine.Engine

Base Engine metaclass.

pandera.engines.numpy_engine.Engine

Numpy data type engine.

pandera.engines.pandas_engine.Engine

Pandas data type engine.

Schema Models#

Schema Model#

pandera.model.SchemaModel(*args, **kwargs)

Definition of a DataFrameSchema.

Model Components#

pandera.model_components.Field(*[, eq, ne, ...])

Used to provide extra information about a field of a SchemaModel.

pandera.model_components.check(*fields[, regex])

Decorator to make SchemaModel method a column/index check function.

pandera.model_components.dataframe_check([_fn])

Decorator to make SchemaModel method a dataframe-wide check function.

Typing#

pandera.typing

Typing module.

Config#

pandera.model.BaseConfig

Define DataFrameSchema-wide options.

Decorators#

pandera.decorators.check_input

Validate function argument when function is called.

pandera.decorators.check_output

Validate function output.

pandera.decorators.check_io

Check schema for multiple inputs and outputs.

pandera.decorators.check_types

Validate function inputs and output based on type annotations.

Schema Inference#

pandera.schema_inference.infer_schema

Infer schema for pandas DataFrame or Series object.

IO Utilities#

The io module and built-in Hypothesis checks require a pandera installation with the corresponding extension, see the installation instructions for more details.

pandera.io.from_yaml

Create DataFrameSchema from yaml file.

pandera.io.to_yaml

Write DataFrameSchema to yaml file.

pandera.io.to_script

Write DataFrameSchema to a python script.

Data Synthesis Strategies#

pandera.strategies

Generate synthetic data from a schema definition.

Extensions#

pandera.extensions

pandera API extensions

Errors#

pandera.errors.SchemaError

Raised when object does not pass schema validation constraints.

pandera.errors.SchemaErrors

Raised when multiple schema are lazily collected into one error.

pandera.errors.SchemaInitError

Raised when schema initialization fails.

pandera.errors.SchemaDefinitionError

Raised when schema definition is invalid on object validation.

Contributing#

Whether you are a novice or experienced software developer, all contributions and suggestions are welcome!

Getting Started#

If you are looking to contribute to the pandera codebase, the best place to start is the GitHub “issues” tab. This is also a great place for filing bug reports and making suggestions for ways in which we can improve the code and documentation.

Contributing to the Codebase#

The code is hosted on GitHub, so you will need to use Git to clone the project and make changes to the codebase.

First create your own fork of pandera, then clone it:

# replace <my-username> with your github username
git clone https://github.com/<my-username>/pandera.git

Once you’ve obtained a copy of the code, create a development environment that’s separate from your existing Python environment so that you can make and test changes without compromising your own work environment.

An excellent guide on setting up python environments can be found here. Pandera offers a environment.yml to set up a conda-based environment and requirements-dev.txt for a virtualenv.

Environment Setup#
Option 1: miniconda Setup#

Install miniconda, then run:

conda create -n pandera-dev python=3.8  # or any python version 3.7+
conda env update -n pandera-dev -f environment.yml
conda activate pandera-dev
pip install -e .
Option 2: virtualenv Setup#
pip install virtualenv
virtualenv .venv/pandera-dev
source .venv/pandera-dev/bin/activate
pip install -r requirements-dev.txt
pip install -e .
Run Tests#
pytest tests
Build Documentation Locally#
make docs
Adding New Dependencies#

To add new dependencies to the project, make sure to alter the environment.yml file. Then to sync the dependencies from the environment.yml file to the requirements-dev.txt run the following command

python scripts/generate_pip_deps_from_conda.py

Moreover to add new dependecies in setup.py, it is necessary to add it to the _extras_require dictionary.

Set up pre-commit#

This project uses pre-commit to ensure that code standard checks pass locally before pushing to the remote project repo. Follow the installation instructions, then set up hooks with pre-commit install. After, black, pylint and mypy checks should be run with every commit.

Make sure everything is working correctly by running

pre-commit run --all
Making Changes#

Before making changes to the codebase or documentation, create a new branch with:

git checkout -b <my-branch>

We recommend following the branch-naming convention described in Making Pull Requests.

Run the Full Test Suite Locally#

Before submitting your changes for review, make sure to check that your changes do not break any tests by running:

# option 1: if you're working with conda (recommended)
$ make nox-conda

# option 2: if you're working with virtualenv
$ make nox

Option 2 assumes that you have python environments for all of the versions that pandera supports.

Using mamba (optional)#

You can also use mamba, which is a faster implementation of miniconda, to run the nox test suite. Simply install it via conda-forge, and make nox-conda should use it under the hood.

$ conda install -c conda-forge mamba
$ make nox-conda
Project Releases#

Releases are organized under milestones, which are be associated with a corresponding branch. This project uses semantic versioning, and we recommend prioritizing issues associated with the next release.

Contributing Documentation#

Maybe the easiest, fastest, and most useful way to contribute to this project (and any other project) is to contribute documentation. If you find an API within the project that doesn’t have an example or description, or could be clearer in its explanation, contribute yours!

You can also find issues for improving documentation under the docs label. If you have ideas for documentation improvements, you can create a new issue here

This project uses Sphinx for auto-documentation and RST syntax for docstrings. Once you have the code downloaded and you find something that is in need of some TLD, take a look at the Sphinx documentation or well-documented examples within the codebase for guidance on contributing.

You can build the html documentation by running nox -s docs. The built documentation can be found in docs/_build.

Contributing Bugfixes#

Bugs are reported under the bug label, so if you find a bug create a new issue here.

Contributing Enhancements#

New feature issues can be found under the enhancements label. You can request a feature by creating a new issue here.

Making Pull Requests#

Once your changes are ready to be submitted, make sure to push your changes to your fork of the GitHub repo before creating a pull request. Depending on the type of issue the pull request is resolving, your pull request should merge onto the appropriate branch:

Bugfixes#
  • branch naming convention: bugfix/<issue number> or bugfix/<bugfix-name>

  • pull request to: dev

Documentation#
  • branch naming convention: docs/<issue number> or docs/<doc-name>

  • pull request to: release/x.x.x branch if specified in the issue milestone, otherwise dev

Enhancements#
  • branch naming convention: feature/<issue number> or feature/<bugfix-name>

  • pull request to: release/x.x.x branch if specified in the issue milestone, otherwise dev

We will review your changes, and might ask you to make additional changes before it is finally ready to merge. However, once it’s ready, we will merge it, and you will have successfully contributed to the codebase!

Questions, Ideas, General Discussion#

Head on over to the discussion section if you have questions or ideas, want to show off something that you did with pandera, or want to discuss a topic related to the project.

Dataframe Schema Style Guides#

We have guidelines regarding dataframe and schema styles that are encouraged for each pull request:

  • If specifying a single column DataFrame, this can be expressed as a one-liner:

    DataFrameSchema({"col1": Column(...)})
    
  • If specifying one column with multiple lines, or multiple columns:

    DataFrameSchema(
        {
            "col1": Column(
                int,
                checks=[
                    Check(...),
                    Check(...),
                ]
            ),
        }
    )
    
  • If specifying columns with additional arguments that fit in one line:

    DataFrameSchema(
        {"a": Column(int, nullable=True)},
        strict=True
    )
    
  • If specifying columns with additional arguments that don’t fit in one line:

    DataFrameSchema(
        {
            "a": Column(
                int,
                nullable=True,
                coerce=True,
                ...
            ),
            "b": Column(
                ...,
            )
        },
        strict=True)
    

Deprecation policy#

This project adopts a rolling policy regarding the minimum supported version of its dependencies, based on NEP 29:

  • Python: 42 months

  • NumPy: 24 months

  • Pandas: 18 months

This means the latest minor (X.Y) version from N months prior. Patch versions (x.y.Z) are not pinned, and only the latest available at the moment of publishing the xarray release is guaranteed to work.

How to Cite#

If you use pandera in the context of academic or industry research, please consider citing the paper and/or software package.

Paper#

@InProceedings{ niels_bantilan-proc-scipy-2020,
  author    = { {N}iels {B}antilan },
  title     = { pandera: {S}tatistical {D}ata {V}alidation of {P}andas {D}ataframes },
  booktitle = { {P}roceedings of the 19th {P}ython in {S}cience {C}onference },
  pages     = { 116 - 124 },
  year      = { 2020 },
  editor    = { {M}eghann {A}garwal and {C}hris {C}alloway and {D}illon {N}iederhut and {D}avid {S}hupe },
  doi       = { 10.25080/Majora-342d178e-010 }
}

Software Package#

software package

License and Credits#

pandera is licensed under the MIT license. and is written and maintained by Niels Bantilan (niels@pandera.ci)

Indices and tables#