Validating with Checks¶
Checks are one of the fundamental constructs of pandera. They allow you to specify properties about dataframes, columns, indexes, and series objects, which are applied after data type validation/coercion and the core pandera checks are applied to the data to be validated.
Important
You can learn more about how data type validation works Data Type Validation.
Checking column properties¶
Check
objects accept a function as a required argument, which is
expected to take a pa.Series
input and output a boolean
or a Series
of boolean values. For the check to pass, all of the elements in the boolean
series must evaluate to True
, for example:
import pandera as pa
import pandas as pd
check_lt_10 = pa.Check(lambda s: s <= 10)
schema = pa.DataFrameSchema({"column1": pa.Column(int, check_lt_10)})
schema.validate(pd.DataFrame({"column1": range(10)}))
column1 | |
---|---|
0 | 0 |
1 | 1 |
2 | 2 |
3 | 3 |
4 | 4 |
5 | 5 |
6 | 6 |
7 | 7 |
8 | 8 |
9 | 9 |
Multiple checks can be applied to a column:
schema = pa.DataFrameSchema({
"column2": pa.Column(str, [
pa.Check(lambda s: s.str.startswith("value")),
pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
]),
})
Built-in Checks¶
For common validation tasks, built-in checks are available in pandera
.
import pandera as pa
schema = pa.DataFrameSchema({
"small_values": pa.Column(float, pa.Check.less_than(100)),
"one_to_three": pa.Column(int, pa.Check.isin([1, 2, 3])),
"phone_number": pa.Column(str, pa.Check.str_matches(r'^[a-z0-9-]+$')),
})
See the Check
API reference for a complete list of built-in checks.
Vectorized vs. Element-wise Checks¶
By default, Check
objects operate on pd.Series
objects. If you want to make atomic checks for each element in the Column, then
you can provide the element_wise=True
keyword argument:
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema({
"a": pa.Column(
int,
checks=[
# a vectorized check that returns a bool
pa.Check(lambda s: s.mean() > 5, element_wise=False),
# a vectorized check that returns a boolean series
pa.Check(lambda s: s > 0, element_wise=False),
# an element-wise check that returns a bool
pa.Check(lambda x: x > 0, element_wise=True),
]
),
})
df = pd.DataFrame({"a": [4, 4, 5, 6, 6, 7, 8, 9]})
schema.validate(df)
a | |
---|---|
0 | 4 |
1 | 4 |
2 | 5 |
3 | 6 |
4 | 6 |
5 | 7 |
6 | 8 |
7 | 9 |
element_wise == False
by default so that you can take advantage of the
speed gains provided by the pd.Series
API by writing vectorized
checks.
Handling Null Values¶
By default, pandera
drops null values before passing the objects to
validate into the check function. For Series
objects null elements are
dropped (this also applies to columns), and for DataFrame
objects, rows
with any null value are dropped.
If you want to check the properties of a pandas data structure while preserving
null values, specify Check(..., ignore_na=False)
when defining a check.
Note that this is different from the nullable
argument in Column
objects, which simply checks for null values in a column.
Column Check Groups¶
Column
checks support grouping by a different column so that you
can make assertions about subsets of the column of interest. This
changes the function signature of the Check
function so that its
input is a dict where keys are the group names and values are subsets of the
series being validated.
Specifying groupby
as a column name, list of column names, or
callable changes the expected signature of the Check
function argument to:
Callable[Dict[Any, pd.Series] -> Union[bool, pd.Series]
where the dict keys are the discrete keys in the groupby
columns.
In the example below we define a DataFrameSchema
with column checks
for height_in_feet
using a single column, multiple columns, and a more
complex groupby function that creates a new column age_less_than_15
on the
fly.
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema({
"height_in_feet": pa.Column(
float, [
# groupby as a single column
pa.Check(
lambda g: g[False].mean() > 6,
groupby="age_less_than_20"),
# define multiple groupby columns
pa.Check(
lambda g: g[(True, "F")].sum() == 9.1,
groupby=["age_less_than_20", "sex"]),
# groupby as a callable with signature:
# (DataFrame) -> DataFrameGroupBy
pa.Check(
lambda g: g[(False, "M")].median() == 6.75,
groupby=lambda df: (
df.assign(age_less_than_15=lambda d: d["age"] < 15)
.groupby(["age_less_than_15", "sex"]))),
]),
"age": pa.Column(int, pa.Check(lambda s: s > 0)),
"age_less_than_20": pa.Column(bool),
"sex": pa.Column(str, pa.Check(lambda s: s.isin(["M", "F"])))
})
df = (
pd.DataFrame({
"height_in_feet": [6.5, 7, 6.1, 5.1, 4],
"age": [25, 30, 21, 18, 13],
"sex": ["M", "M", "F", "F", "F"]
})
.assign(age_less_than_20=lambda x: x["age"] < 20)
)
schema.validate(df)
height_in_feet | age | sex | age_less_than_20 | |
---|---|---|---|---|
0 | 6.5 | 25 | M | False |
1 | 7.0 | 30 | M | False |
2 | 6.1 | 21 | F | False |
3 | 5.1 | 18 | F | True |
4 | 4.0 | 13 | F | True |
Wide Checks¶
pandera
is primarily designed to operate on long-form data (commonly known
as tidy data), where each row
is an observation and each column is an attribute associated with an
observation.
However, pandera
also supports checks on wide-form data to operate across
columns in a DataFrame
. For example, if you want to make assertions about
height
across two groups, the tidy dataset and schema might look like this:
import pandas as pd
import pandera as pa
df = pd.DataFrame({
"height": [5.6, 6.4, 4.0, 7.1],
"group": ["A", "B", "A", "B"],
})
schema = pa.DataFrameSchema({
"height": pa.Column(
float,
pa.Check(lambda g: g["A"].mean() < g["B"].mean(), groupby="group")
),
"group": pa.Column(str)
})
schema.validate(df)
height | group | |
---|---|---|
0 | 5.6 | A |
1 | 6.4 | B |
2 | 4.0 | A |
3 | 7.1 | B |
Whereas the equivalent wide-form schema would look like this:
df = pd.DataFrame({
"height_A": [5.6, 4.0],
"height_B": [6.4, 7.1],
})
schema = pa.DataFrameSchema(
columns={
"height_A": pa.Column(float),
"height_B": pa.Column(float),
},
# define checks at the DataFrameSchema-level
checks=pa.Check(
lambda df: df["height_A"].mean() < df["height_B"].mean()
)
)
schema.validate(df)
height_A | height_B | |
---|---|---|
0 | 5.6 | 6.4 |
1 | 4.0 | 7.1 |
You can see that when checks are supplied to the DataFrameSchema
checks
key-word argument, the check function should expect a pandas DataFrame
and
should return a bool
, a Series
of booleans, or a DataFrame
of
boolean values.
Raise Warning Instead of Error on Check Failure¶
In some cases, you might want to raise a warning and continue execution
of your program. The Check
and Hypothesis
classes and their built-in
methods support the keyword argument raise_warning
, which is False
by default. If set to True
, the check will warn with a SchemaWarning
instead
of raising a SchemaError
exception.
Note
Use this feature carefully! If the check is for informational purposes and
not critical for data integrity then use raise_warning=True
. However,
if the assumptions expressed in a Check
are necessary conditions to
considering your data valid, do not set this option to true.
One scenario where you’d want to do this would be in a data pipeline that does some preprocessing, checks for normality in certain columns, and writes the resulting dataset to a table. In this case, you want to see if your normality assumptions are not fulfilled by certain columns, but you still want the resulting table for further analysis.
import warnings
import numpy as np
import pandas as pd
import pandera as pa
from scipy.stats import normaltest
np.random.seed(1000)
df = pd.DataFrame({
"var1": np.random.normal(loc=0, scale=1, size=1000),
"var2": np.random.uniform(low=0, high=10, size=1000),
})
normal_check = pa.Hypothesis(
test=normaltest,
samples="normal_variable",
# null hypotheses: sample comes from a normal distribution. The
# relationship function checks if we cannot reject the null hypothesis,
# i.e. the p-value is greater or equal to alpha.
relationship=lambda stat, pvalue, alpha=0.05: pvalue >= alpha,
error="normality test",
raise_warning=True,
)
schema = pa.DataFrameSchema(
columns={
"var1": pa.Column(checks=normal_check),
"var2": pa.Column(checks=normal_check),
}
)
# catch and print warnings
with warnings.catch_warnings(record=True) as caught_warnings:
warnings.simplefilter("always")
validated_df = schema(df)
for warning in caught_warnings:
print(warning.message)
Column 'var2' failed series or dataframe validator 0: <Check normaltest: normality test>
Registering Custom Checks¶
pandera
now offers an interface to register custom checks functions so
that they’re available in the Check
namespace. See
the extensions document for more information.