Hypothesis Testing¶

pandera enables you to perform statistical hypothesis tests on your data.

Note

The hypothesis feature requires a pandera installation with hypotheses dependency set. See the installation instructions for more details.

Overview¶

The Hypothesis class defines built in methods, which can be called as in this example of a two-sample t-test:

import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema, Check, Hypothesis

from scipy import stats

df = (
    pd.DataFrame({
        "height_in_feet": [6.5, 7, 6.1, 5.1, 4],
        "sex": ["M", "M", "F", "F", "F"]
    })
)

schema = DataFrameSchema({
    "height_in_feet": Column(
        float, [
            Hypothesis.two_sample_ttest(
                sample1="M",
                sample2="F",
                groupby="sex",
                relationship="greater_than",
                alpha=0.05,
                equal_var=True),
    ]),
    "sex": Column(str)
})

try:
    schema.validate(df)
except pa.errors.SchemaError as exc:
    print(exc)

Column 'height_in_feet' failed series or dataframe validator 0: <Check two_sample_ttest: failed two sample ttest between 'M' and 'F'>

You can also define custom hypotheses by passing in functions to the test and relationship arguments.

The test function takes as input one or multiple array-like objects and should return a stat, which is the test statistic, and pvalue for assessing statistical significance. It also takes key-word arguments supplied by the test_kwargs dict when initializing a Hypothesis object.

The relationship function should take all of the outputs of test as positional arguments, in addition to key-word arguments supplied by the relationship_kwargs dict.

Here’s an implementation of the two-sample t-test that uses the scipy implementation:

def two_sample_ttest(array1, array2):
    # the "height_in_feet" series is first grouped by "sex" and then
    # passed into the custom `test` function as two separate arrays in the
    # order specified in the `samples` argument.
    return stats.ttest_ind(array1, array2)


def null_relationship(stat, pvalue, alpha=0.01):
    return pvalue / 2 >= alpha


schema = DataFrameSchema({
    "height_in_feet": Column(
        float, [
            Hypothesis(
                test=two_sample_ttest,
                samples=["M", "F"],
                groupby="sex",
                relationship=null_relationship,
                relationship_kwargs={"alpha": 0.05}
            )
    ]),
    "sex": Column(str, checks=Check.isin(["M", "F"]))
})

schema.validate(df)

	height_in_feet	sex
0	6.5	M
1	7.0	M
2	6.1	F
3	5.1	F
4	4.0	F

Wide Hypotheses¶

pandera is primarily designed to operate on long-form data (commonly known as tidy data), where each row is an observation and columns are attributes associated with the observation.

However, pandera also supports hypothesis testing on wide-form data to operate across columns in a DataFrame.

For example, if you want to make assertions about height across two groups, the tidy dataset and schema might look like this:

import pandas as pd
import pandera as pa

from pandera import Check, DataFrameSchema, Column, Hypothesis

df = pd.DataFrame({
    "height": [5.6, 7.5, 4.0, 7.9],
    "group": ["A", "B", "A", "B"],
})

schema = DataFrameSchema({
    "height": Column(
        float, Hypothesis.two_sample_ttest(
            "A", "B",
            groupby="group",
            relationship="less_than",
            alpha=0.05
        )
    ),
    "group": Column(str, Check(lambda s: s.isin(["A", "B"])))
})

schema.validate(df)

	height	group
0	5.6	A
1	7.5	B
2	4.0	A
3	7.9	B

The equivalent wide-form schema would look like this:

import pandas as pd
import pandera as pa

from pandera import DataFrameSchema, Column, Hypothesis

df = pd.DataFrame({
    "height_A": [5.6, 4.0],
    "height_B": [7.5, 7.9],
})

schema = DataFrameSchema(
    columns={
        "height_A": Column(float),
        "height_B": Column(float),
    },
    # define checks at the DataFrameSchema-level
    checks=Hypothesis.two_sample_ttest(
        "height_A", "height_B",
        relationship="less_than",
        alpha=0.05
    )
)

schema.validate(df)

	height_A	height_B
0	5.6	7.5
1	4.0	7.9