Hypothesis Testing#
pandera
enables you to perform statistical hypothesis tests on your data.
Note
The hypothesis feature requires a pandera installation with hypotheses
dependency set. See the installation instructions for
more details.
Overview#
The Hypothesis
class defines built in methods,
which can be called as in this example of a two-sample t-test:
import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema, Check, Hypothesis
from scipy import stats
df = (
pd.DataFrame({
"height_in_feet": [6.5, 7, 6.1, 5.1, 4],
"sex": ["M", "M", "F", "F", "F"]
})
)
schema = DataFrameSchema({
"height_in_feet": Column(
float, [
Hypothesis.two_sample_ttest(
sample1="M",
sample2="F",
groupby="sex",
relationship="greater_than",
alpha=0.05,
equal_var=True),
]),
"sex": Column(str)
})
schema.validate(df)
Traceback (most recent call last):
...
pandera.SchemaError: <Schema Column: 'height_in_feet' type=float64> failed series validator 0: hypothesis_check: failed two sample ttest between 'M' and 'F'
You can also define custom hypotheses by passing in functions to the
test
and relationship
arguments.
The test
function takes as input one or multiple array-like objects
and should return a stat
, which is the test statistic, and pvalue
for
assessing statistical significance. It also takes key-word arguments supplied
by the test_kwargs
dict when initializing a Hypothesis
object.
The relationship
function should take all of the outputs of test
as
positional arguments, in addition to key-word arguments supplied by the
relationship_kwargs
dict.
Here’s an implementation of the two-sample t-test that uses the scipy implementation:
def two_sample_ttest(array1, array2):
# the "height_in_feet" series is first grouped by "sex" and then
# passed into the custom `test` function as two separate arrays in the
# order specified in the `samples` argument.
return stats.ttest_ind(array1, array2)
def null_relationship(stat, pvalue, alpha=0.01):
return pvalue / 2 >= alpha
schema = DataFrameSchema({
"height_in_feet": Column(
float, [
Hypothesis(
test=two_sample_ttest,
samples=["M", "F"],
groupby="sex",
relationship=null_relationship,
relationship_kwargs={"alpha": 0.05}
)
]),
"sex": Column(str, checks=Check.isin(["M", "F"]))
})
schema.validate(df)
Wide Hypotheses#
pandera
is primarily designed to operate on long-form data (commonly known
as tidy data), where each row
is an observation and columns are attributes associated with the observation.
However, pandera
also supports hypothesis testing on wide-form data to
operate across columns in a DataFrame
.
For example, if you want to make assertions about height
across two groups,
the tidy dataset and schema might look like this:
import pandas as pd
import pandera as pa
from pandera import Check, DataFrameSchema, Column, Hypothesis
df = pd.DataFrame({
"height": [5.6, 7.5, 4.0, 7.9],
"group": ["A", "B", "A", "B"],
})
schema = DataFrameSchema({
"height": Column(
float, Hypothesis.two_sample_ttest(
"A", "B",
groupby="group",
relationship="less_than",
alpha=0.05
)
),
"group": Column(str, Check(lambda s: s.isin(["A", "B"])))
})
schema.validate(df)
The equivalent wide-form schema would look like this:
import pandas as pd
import pandera as pa
from pandera import DataFrameSchema, Column, Hypothesis
df = pd.DataFrame({
"height_A": [5.6, 4.0],
"height_B": [7.5, 7.9],
})
schema = DataFrameSchema(
columns={
"height_A": Column(Float),
"height_B": Column(Float),
},
# define checks at the DataFrameSchema-level
checks=Hypothesis.two_sample_ttest(
"height_A", "height_B",
relationship="less_than",
alpha=0.05
)
)
schema.validate(df)