pandera.Hypothesis.two_sample_ttest

classmethod Hypothesis.two_sample_ttest(sample1, sample2, groupby=None, relationship='equal', alpha=0.01, equal_var=True, nan_policy='propagate', raise_warning=False)[source]

Calculate a t-test for the means of two samples.

Perform a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.

Parameters
  • sample1 (str) – The first sample group to test. For Column and SeriesSchema hypotheses, refers to the level in the groupby column. For DataFrameSchema hypotheses, refers to column in the DataFrame.

  • sample2 (str) – The second sample group to test. For Column and SeriesSchema hypotheses, refers to the level in the groupby column. For DataFrameSchema hypotheses, refers to column in the DataFrame.

  • groupby (Union[str, List[str], Callable, None]) –

    If a string or list of strings is provided, then these columns are used to group the Column Series by groupby. If a callable is passed, the expected signature is DataFrame -> DataFrameGroupby. The function has access to the entire dataframe, but the Column.name is selected from this DataFrameGroupby object so that a SeriesGroupBy object is passed into fn.

    Specifying this argument changes the fn signature to: dict[str|tuple[str], Series] -> bool|pd.Series[bool]

    Where specific groups can be obtained from the input dict.

  • relationship (str) – Represents what relationship conditions are imposed on the hypothesis test. Available relationships are: “greater_than”, “less_than”, “not_equal”, and “equal”. For example, group1 greater_than group2 specifies an alternative hypothesis that the mean of group1 is greater than group 2 relative to a null hypothesis that they are equal.

  • alpha – (Default value = 0.01) The significance level; the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.01 indicates a 1% risk of concluding that a difference exists when there is no actual difference.

  • equal_var – (Default value = True) If True (default), perform a standard independent 2 sample test that assumes equal population variances. If False, perform Welch’s t-test, which does not assume equal population variance

  • nan_policy – Defines how to handle when input returns nan, one of {‘propagate’, ‘raise’, ‘omit’}, (Default value = ‘propagate’). For more details see: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

  • raise_warning – if True, check raises UserWarning instead of SchemaError on validation.

Example

The the built-in class method to do a two-sample t-test.

>>> import pandera as pa
>>>
>>>
>>> schema = pa.DataFrameSchema({
...     "height_in_feet": pa.Column(
...         pa.Float, [
...             pa.Hypothesis.two_sample_ttest(
...                 sample1="A",
...                 sample2="B",
...                 groupby="group",
...                 relationship="greater_than",
...                 alpha=0.05,
...                 equal_var=True),
...     ]),
...     "group": pa.Column(pa.String)
... })
>>> df = (
...     pd.DataFrame({
...         "height_in_feet": [8.1, 7, 5.2, 5.1, 4],
...         "group": ["A", "A", "B", "B", "B"]
...     })
... )
>>> schema.validate(df)[["height_in_feet", "group"]]
   height_in_feet group
0             8.1     A
1             7.0     A
2             5.2     B
3             5.1     B
4             4.0     B