Hypothesis Testing#

pandera enables you to perform statistical hypothesis tests on your data.

Note

The hypothesis feature requires a pandera installation with hypotheses dependency set. See the installation instructions for more details.

Overview#

The Hypothesis class defines built in methods, which can be called as in this example of a two-sample t-test:

import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema, Check, Hypothesis

from scipy import stats

df = (
    pd.DataFrame({
        "height_in_feet": [6.5, 7, 6.1, 5.1, 4],
        "sex": ["M", "M", "F", "F", "F"]
    })
)

schema = DataFrameSchema({
    "height_in_feet": Column(
        float, [
            Hypothesis.two_sample_ttest(
                sample1="M",
                sample2="F",
                groupby="sex",
                relationship="greater_than",
                alpha=0.05,
                equal_var=True),
    ]),
    "sex": Column(str)
})

schema.validate(df)
Traceback (most recent call last):
...
pandera.SchemaError: <Schema Column: 'height_in_feet' type=float64> failed series validator 0: hypothesis_check: failed two sample ttest between 'M' and 'F'

You can also define custom hypotheses by passing in functions to the test and relationship arguments.

The test function takes as input one or multiple array-like objects and should return a stat, which is the test statistic, and pvalue for assessing statistical significance. It also takes key-word arguments supplied by the test_kwargs dict when initializing a Hypothesis object.

The relationship function should take all of the outputs of test as positional arguments, in addition to key-word arguments supplied by the relationship_kwargs dict.

Here’s an implementation of the two-sample t-test that uses the scipy implementation:

def two_sample_ttest(array1, array2):
    # the "height_in_feet" series is first grouped by "sex" and then
    # passed into the custom `test` function as two separate arrays in the
    # order specified in the `samples` argument.
    return stats.ttest_ind(array1, array2)


def null_relationship(stat, pvalue, alpha=0.01):
    return pvalue / 2 >= alpha


schema = DataFrameSchema({
    "height_in_feet": Column(
        float, [
            Hypothesis(
                test=two_sample_ttest,
                samples=["M", "F"],
                groupby="sex",
                relationship=null_relationship,
                relationship_kwargs={"alpha": 0.05}
            )
    ]),
    "sex": Column(str, checks=Check.isin(["M", "F"]))
})

schema.validate(df)

Wide Hypotheses#

pandera is primarily designed to operate on long-form data (commonly known as tidy data), where each row is an observation and columns are attributes associated with the observation.

However, pandera also supports hypothesis testing on wide-form data to operate across columns in a DataFrame.

For example, if you want to make assertions about height across two groups, the tidy dataset and schema might look like this:

import pandas as pd
import pandera as pa

from pandera import Check, DataFrameSchema, Column, Hypothesis

df = pd.DataFrame({
    "height": [5.6, 7.5, 4.0, 7.9],
    "group": ["A", "B", "A", "B"],
})

schema = DataFrameSchema({
    "height": Column(
        float, Hypothesis.two_sample_ttest(
            "A", "B",
            groupby="group",
            relationship="less_than",
            alpha=0.05
        )
    ),
    "group": Column(str, Check(lambda s: s.isin(["A", "B"])))
})

schema.validate(df)

The equivalent wide-form schema would look like this:

import pandas as pd
import pandera as pa

from pandera import DataFrameSchema, Column, Hypothesis

df = pd.DataFrame({
    "height_A": [5.6, 4.0],
    "height_B": [7.5, 7.9],
})

schema = DataFrameSchema(
    columns={
        "height_A": Column(Float),
        "height_B": Column(Float),
    },
    # define checks at the DataFrameSchema-level
    checks=Hypothesis.two_sample_ttest(
        "height_A", "height_B",
        relationship="less_than",
        alpha=0.05
    )
)

schema.validate(df)