pandera.api.hypotheses.Hypothesis

class pandera.api.hypotheses.Hypothesis(test, samples=None, groupby=None, relationship='equal', alpha=None, test_kwargs=None, relationship_kwargs=None, name=None, error=None, raise_warning=False, n_failure_cases=None, title=None, description=None, statistics=None, strategy=None, **check_kwargs)[source]

Special type of Check that defines hypothesis tests on data.

Perform a hypothesis test on a Series or DataFrame.

Parameters:
  • test (Callable) – The hypothesis test function. It should take one or more arrays as positional arguments and return a test statistic and a p-value. The arrays passed into the test function are determined by the samples argument.

  • samples (Union[str, List[str], None]) –

    for Column or SeriesSchema hypotheses, this refers to the group keys in the groupby column(s) used to group the Series into a dict of Series. The samples column(s) are passed into the test function as positional arguments.

    For DataFrame-level hypotheses, samples refers to a column or multiple columns to pass into the test function. The samples column(s) are passed into the test function as positional arguments.

  • groupby (Union[str, List[str], Callable, None]) –

    If a string or list of strings is provided, then these columns are used to group the Column Series by groupby. If a callable is passed, the expected signature is DataFrame -> DataFrameGroupby. The function has access to the entire dataframe, but the Column.name is selected from this DataFrameGroupby object so that a SeriesGroupBy object is passed into the hypothesis_check function.

    Specifying this argument changes the fn signature to: dict[str|tuple[str], Series] -> bool|pd.Series[bool]

    Where specific groups can be obtained from the input dict.

  • relationship (Union[str, Callable]) –

    Represents what relationship conditions are imposed on the hypothesis test. A function or lambda function can be supplied.

    Available built-in relationships are: “greater_than”, “less_than”, “not_equal” or “equal”, where “equal” is the null hypothesis.

    If callable, the input function signature should have the signature (stat: float, pvalue: float, **kwargs) where stat is the hypothesis test statistic, pvalue assesses statistical significance, and **kwargs are other arguments supplied via the **relationship_kwargs argument.

    Default is “equal” for the null hypothesis.

  • alpha (Optional[float, None]) – significance level, if applicable to the hypothesis check.

  • test_kwargs (dict) – Keyword arguments to be supplied to the test.

  • relationship_kwargs (dict) – Keyword arguments to be supplied to the relationship function. e.g. alpha could be used to specify a threshold in a t-test.

  • name (Optional[str, None]) – optional name of hypothesis test

  • error (Optional[str, None]) – error message to show

  • raise_warning (bool) – if True, raise a SchemaWarning and do not throw exception instead of raising a SchemaError for a specific check. This option should be used carefully in cases where a failing check is informational and shouldn’t stop execution of the program.

  • n_failure_cases (Optional[int, None]) – report the first n unique failure cases. If None, report all failure cases.

  • title (Optional[str, None]) – A human-readable label for the check.

  • description (Optional[str, None]) – An arbitrary textual description of the check.

  • statistics (Optional[Dict[str, Any], None]) – kwargs to pass into the check function. These values are serialized and represent the constraints of the checks.

  • strategy (Optional[Any, None]) – A hypothesis strategy, used for implementing data synthesis strategies for this check.

  • check_kwargs – key-word arguments to pass into check_fn

Examples:

Define a two-sample hypothesis test using scipy.

>>> import pandas as pd
>>> import pandera as pa
>>>
>>> from scipy import stats
>>>
>>> schema = pa.DataFrameSchema({
...     "height_in_feet": pa.Column(float, [
...         pa.Hypothesis(
...             test=stats.ttest_ind,
...             samples=["A", "B"],
...             groupby="group",
...             # assert that the mean height of group "A" is greater
...             # than that of group "B"
...             relationship=lambda stat, pvalue, alpha=0.1: (
...                 stat > 0 and pvalue / 2 < alpha
...             ),
...             # set alpha criterion to 5%
...             relationship_kwargs={"alpha": 0.05}
...         )
...     ]),
...     "group": pa.Column(str),
... })
>>> df = (
...     pd.DataFrame({
...         "height_in_feet": [8.1, 7, 5.2, 5.1, 4],
...         "group": ["A", "A", "B", "B", "B"]
...     })
... )
>>> schema.validate(df)[["height_in_feet", "group"]]
   height_in_feet group
0             8.1     A
1             7.0     A
2             5.2     B
3             5.1     B
4             4.0     B

See here for more usage details.

Attributes

RELATIONSHIPS

BACKEND_REGISTRY

CHECK_FUNCTION_REGISTRY

REGISTERED_CUSTOM_CHECKS

Methods

__init__(test, samples=None, groupby=None, relationship='equal', alpha=None, test_kwargs=None, relationship_kwargs=None, name=None, error=None, raise_warning=False, n_failure_cases=None, title=None, description=None, statistics=None, strategy=None, **check_kwargs)[source]

Perform a hypothesis test on a Series or DataFrame.

Parameters:
  • test (Callable) – The hypothesis test function. It should take one or more arrays as positional arguments and return a test statistic and a p-value. The arrays passed into the test function are determined by the samples argument.

  • samples (Union[str, List[str], None]) –

    for Column or SeriesSchema hypotheses, this refers to the group keys in the groupby column(s) used to group the Series into a dict of Series. The samples column(s) are passed into the test function as positional arguments.

    For DataFrame-level hypotheses, samples refers to a column or multiple columns to pass into the test function. The samples column(s) are passed into the test function as positional arguments.

  • groupby (Union[str, List[str], Callable, None]) –

    If a string or list of strings is provided, then these columns are used to group the Column Series by groupby. If a callable is passed, the expected signature is DataFrame -> DataFrameGroupby. The function has access to the entire dataframe, but the Column.name is selected from this DataFrameGroupby object so that a SeriesGroupBy object is passed into the hypothesis_check function.

    Specifying this argument changes the fn signature to: dict[str|tuple[str], Series] -> bool|pd.Series[bool]

    Where specific groups can be obtained from the input dict.

  • relationship (Union[str, Callable]) –

    Represents what relationship conditions are imposed on the hypothesis test. A function or lambda function can be supplied.

    Available built-in relationships are: “greater_than”, “less_than”, “not_equal” or “equal”, where “equal” is the null hypothesis.

    If callable, the input function signature should have the signature (stat: float, pvalue: float, **kwargs) where stat is the hypothesis test statistic, pvalue assesses statistical significance, and **kwargs are other arguments supplied via the **relationship_kwargs argument.

    Default is “equal” for the null hypothesis.

  • alpha (Optional[float, None]) – significance level, if applicable to the hypothesis check.

  • test_kwargs (dict) – Keyword arguments to be supplied to the test.

  • relationship_kwargs (dict) – Keyword arguments to be supplied to the relationship function. e.g. alpha could be used to specify a threshold in a t-test.

  • name (Optional[str, None]) – optional name of hypothesis test

  • error (Optional[str, None]) – error message to show

  • raise_warning (bool) – if True, raise a SchemaWarning and do not throw exception instead of raising a SchemaError for a specific check. This option should be used carefully in cases where a failing check is informational and shouldn’t stop execution of the program.

  • n_failure_cases (Optional[int, None]) – report the first n unique failure cases. If None, report all failure cases.

  • title (Optional[str, None]) – A human-readable label for the check.

  • description (Optional[str, None]) – An arbitrary textual description of the check.

  • statistics (Optional[Dict[str, Any], None]) – kwargs to pass into the check function. These values are serialized and represent the constraints of the checks.

  • strategy (Optional[Any, None]) – A hypothesis strategy, used for implementing data synthesis strategies for this check.

  • check_kwargs – key-word arguments to pass into check_fn

Examples:

Define a two-sample hypothesis test using scipy.

>>> import pandas as pd
>>> import pandera as pa
>>>
>>> from scipy import stats
>>>
>>> schema = pa.DataFrameSchema({
...     "height_in_feet": pa.Column(float, [
...         pa.Hypothesis(
...             test=stats.ttest_ind,
...             samples=["A", "B"],
...             groupby="group",
...             # assert that the mean height of group "A" is greater
...             # than that of group "B"
...             relationship=lambda stat, pvalue, alpha=0.1: (
...                 stat > 0 and pvalue / 2 < alpha
...             ),
...             # set alpha criterion to 5%
...             relationship_kwargs={"alpha": 0.05}
...         )
...     ]),
...     "group": pa.Column(str),
... })
>>> df = (
...     pd.DataFrame({
...         "height_in_feet": [8.1, 7, 5.2, 5.1, 4],
...         "group": ["A", "A", "B", "B", "B"]
...     })
... )
>>> schema.validate(df)[["height_in_feet", "group"]]
   height_in_feet group
0             8.1     A
1             7.0     A
2             5.2     B
3             5.1     B
4             4.0     B

See here for more usage details.

classmethod one_sample_ttest(popmean, sample=None, groupby=None, relationship='equal', alpha=0.01, nan_policy='propagate', **kwargs)[source]

Calculate a t-test for the mean of one sample.

Parameters:
  • sample (Optional[str, None]) – The sample group to test. For Column and SeriesSchema hypotheses, this refers to the groupby level that is used to subset the Column being checked. For DataFrameSchema hypotheses, refers to column in the DataFrame.

  • groupby (Union[str, List[str], Callable, None]) –

    If a string or list of strings is provided, then these columns are used to group the Column Series by groupby. If a callable is passed, the expected signature is DataFrame -> DataFrameGroupby. The function has access to the entire dataframe, but the Column.name is selected from this DataFrameGroupby object so that a SeriesGroupBy object is passed into fn.

    Specifying this argument changes the fn signature to: dict[str|tuple[str], Series] -> bool|pd.Series[bool]

    Where specific groups can be obtained from the input dict.

  • popmean (float) – population mean to compare sample to.

  • relationship (str) – Represents what relationship conditions are imposed on the hypothesis test. Available relationships are: “greater_than”, “less_than”, “not_equal” and “equal”. For example, group1 greater_than group2 specifies an alternative hypothesis that the mean of group1 is greater than group 2 relative to a null hypothesis that they are equal.

  • alpha (float) – (Default value = 0.01) The significance level; the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.01 indicates a 1% risk of concluding that a difference exists when there is no actual difference.

  • raise_warning – if True, check raises SchemaWarning instead of SchemaError on validation.

Example:

If you want to compare one sample with a pre-defined mean:

>>> import pandas as pd
>>> import pandera as pa
>>>
>>>
>>> schema = pa.DataFrameSchema({
...     "height_in_feet": pa.Column(
...         float, [
...             pa.Hypothesis.one_sample_ttest(
...                 popmean=5,
...                 relationship="greater_than",
...                 alpha=0.1),
...     ]),
... })
>>> df = (
...     pd.DataFrame({
...         "height_in_feet": [8.1, 7, 6.5, 6.7, 5.1],
...     })
... )
>>> schema.validate(df)
    height_in_feet
0             8.1
1             7.0
2             6.5
3             6.7
4             5.1
Return type:

Hypothesis

classmethod two_sample_ttest(sample1, sample2, groupby=None, relationship='equal', alpha=0.01, equal_var=True, nan_policy='propagate', **kwargs)[source]

Calculate a t-test for the means of two samples.

Perform a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.

Parameters:
  • sample1 (str) – The first sample group to test. For Column and SeriesSchema hypotheses, refers to the level in the groupby column. For DataFrameSchema hypotheses, refers to column in the DataFrame.

  • sample2 (str) – The second sample group to test. For Column and SeriesSchema hypotheses, refers to the level in the groupby column. For DataFrameSchema hypotheses, refers to column in the DataFrame.

  • groupby (Union[str, List[str], Callable, None]) –

    If a string or list of strings is provided, then these columns are used to group the Column Series by groupby. If a callable is passed, the expected signature is DataFrame -> DataFrameGroupby. The function has access to the entire dataframe, but the Column.name is selected from this DataFrameGroupby object so that a SeriesGroupBy object is passed into fn.

    Specifying this argument changes the fn signature to: dict[str|tuple[str], Series] -> bool|pd.Series[bool]

    Where specific groups can be obtained from the input dict.

  • relationship (str) – Represents what relationship conditions are imposed on the hypothesis test. Available relationships are: “greater_than”, “less_than”, “not_equal”, and “equal”. For example, group1 greater_than group2 specifies an alternative hypothesis that the mean of group1 is greater than group 2 relative to a null hypothesis that they are equal.

  • alpha (float) – (Default value = 0.01) The significance level; the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.01 indicates a 1% risk of concluding that a difference exists when there is no actual difference.

  • equal_var (bool) – (Default value = True) If True (default), perform a standard independent 2 sample test that assumes equal population variances. If False, perform Welch’s t-test, which does not assume equal population variance

  • nan_policy (str) – Defines how to handle when input returns nan, one of {‘propagate’, ‘raise’, ‘omit’}, (Default value = ‘propagate’). For more details see: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

Example:

The the built-in class method to do a two-sample t-test.

>>> import pandas as pd
>>> import pandera as pa
>>>
>>>
>>> schema = pa.DataFrameSchema({
...     "height_in_feet": pa.Column(
...         float, [
...             pa.Hypothesis.two_sample_ttest(
...                 sample1="A",
...                 sample2="B",
...                 groupby="group",
...                 relationship="greater_than",
...                 alpha=0.05,
...                 equal_var=True),
...     ]),
...     "group": pa.Column(str)
... })
>>> df = (
...     pd.DataFrame({
...         "height_in_feet": [8.1, 7, 5.2, 5.1, 4],
...         "group": ["A", "A", "B", "B", "B"]
...     })
... )
>>> schema.validate(df)[["height_in_feet", "group"]]
    height_in_feet group
0             8.1     A
1             7.0     A
2             5.2     B
3             5.1     B
4             4.0     B
Return type:

Hypothesis

__call__(check_obj, column=None)[source]

Validate DataFrame or Series.

Parameters:
  • check_obj (Any) – DataFrame of Series to validate.

  • column (Optional[str, None]) – for dataframe checks, apply the check function to this column.

Return type:

CheckResult

Returns:

CheckResult tuple containing:

check_output: boolean scalar, Series or DataFrame indicating which elements passed the check.

check_passed: boolean scalar that indicating whether the check passed overall.

checked_object: the checked object itself. Depending on the options provided to the Check, this will be a Series, DataFrame, or if the groupby option is supported by the validation backend and specified, a Dict[str, Series] or Dict[str, DataFrame] where the keys are distinct groups.

failure_cases: subset of the check_object that failed.