pandera.api.hypotheses.Hypothesis#

class pandera.api.hypotheses.Hypothesis(test, samples=None, groupby=None, relationship='equal', alpha=None, test_kwargs=None, relationship_kwargs=None, name=None, error=None, raise_warning=False, n_failure_cases=None, title=None, description=None, statistics=None, strategy=None, **check_kwargs)[source]#

Special type of Check that defines hypothesis tests on data.

Perform a hypothesis test on a Series or DataFrame.

Parameters
  • test (Callable) – The hypothesis test function. It should take one or more arrays as positional arguments and return a test statistic and a p-value. The arrays passed into the test function are determined by the samples argument.

  • samples (Union[str, List[str], None]) –

    for Column or SeriesSchema hypotheses, this refers to the group keys in the groupby column(s) used to group the Series into a dict of Series. The samples column(s) are passed into the test function as positional arguments.

    For DataFrame-level hypotheses, samples refers to a column or multiple columns to pass into the test function. The samples column(s) are passed into the test function as positional arguments.

  • groupby (Union[str, List[str], Callable, None]) –

    If a string or list of strings is provided, then these columns are used to group the Column Series by groupby. If a callable is passed, the expected signature is DataFrame -> DataFrameGroupby. The function has access to the entire dataframe, but the Column.name is selected from this DataFrameGroupby object so that a SeriesGroupBy object is passed into the hypothesis_check function.

    Specifying this argument changes the fn signature to: dict[str|tuple[str], Series] -> bool|pd.Series[bool]

    Where specific groups can be obtained from the input dict.

  • relationship (Union[str, Callable]) –

    Represents what relationship conditions are imposed on the hypothesis test. A function or lambda function can be supplied.

    Available built-in relationships are: “greater_than”, “less_than”, “not_equal” or “equal”, where “equal” is the null hypothesis.

    If callable, the input function signature should have the signature (stat: float, pvalue: float, **kwargs) where stat is the hypothesis test statistic, pvalue assesses statistical significance, and **kwargs are other arguments supplied via the **relationship_kwargs argument.

    Default is “equal” for the null hypothesis.

  • alpha (Optional[float]) – significance level, if applicable to the hypothesis check.

  • test_kwargs (dict) – Keyword arguments to be supplied to the test.

  • relationship_kwargs (dict) – Keyword arguments to be supplied to the relationship function. e.g. alpha could be used to specify a threshold in a t-test.

  • name (Optional[str]) – optional name of hypothesis test

  • error (Optional[str]) – error message to show

  • raise_warning (bool) – if True, raise a UserWarning and do not throw exception instead of raising a SchemaError for a specific check. This option should be used carefully in cases where a failing check is informational and shouldn’t stop execution of the program.

  • n_failure_cases (Optional[int]) – report the first n unique failure cases. If None, report all failure cases.

  • title (Optional[str]) – A human-readable label for the check.

  • description (Optional[str]) – An arbitrary textual description of the check.

  • statistics (Optional[Dict[str, Any]]) – kwargs to pass into the check function. These values are serialized and represent the constraints of the checks.

  • strategy (Optional[SearchStrategy]) – A hypothesis strategy, used for implementing data synthesis strategies for this check.

  • check_kwargs – key-word arguments to pass into check_fn

Examples

Define a two-sample hypothesis test using scipy.

>>> import pandas as pd
>>> import pandera as pa
>>>
>>> from scipy import stats
>>>
>>> schema = pa.DataFrameSchema({
...     "height_in_feet": pa.Column(float, [
...         pa.Hypothesis(
...             test=stats.ttest_ind,
...             samples=["A", "B"],
...             groupby="group",
...             # assert that the mean height of group "A" is greater
...             # than that of group "B"
...             relationship=lambda stat, pvalue, alpha=0.1: (
...                 stat > 0 and pvalue / 2 < alpha
...             ),
...             # set alpha criterion to 5%
...             relationship_kwargs={"alpha": 0.05}
...         )
...     ]),
...     "group": pa.Column(str),
... })
>>> df = (
...     pd.DataFrame({
...         "height_in_feet": [8.1, 7, 5.2, 5.1, 4],
...         "group": ["A", "A", "B", "B", "B"]
...     })
... )
>>> schema.validate(df)[["height_in_feet", "group"]]
   height_in_feet group
0             8.1     A
1             7.0     A
2             5.2     B
3             5.1     B
4             4.0     B

See here for more usage details.

Attributes

RELATIONSHIPS

BACKEND_REGISTRY

CHECK_FUNCTION_REGISTRY

REGISTERED_CUSTOM_CHECKS

Methods

__init__

Perform a hypothesis test on a Series or DataFrame.

one_sample_ttest

Calculate a t-test for the mean of one sample.

two_sample_ttest

Calculate a t-test for the means of two samples.

__call__

Validate pandas DataFrame or Series.