pandera.api.checks.Check

class pandera.api.checks.Check(check_fn, groups=None, groupby=None, ignore_na=True, element_wise=False, name=None, error=None, raise_warning=False, n_failure_cases=None, title=None, description=None, statistics=None, strategy=None, **check_kwargs)[source]

Check a data object for certain properties.

Apply a validation function to a data object.

Parameters:
  • check_fn (Callable) –

    A function to check data object. For Column or SeriesSchema checks, if element_wise is True, this function should have the signature: Callable[[pd.Series], Union[pd.Series, bool]], where the output series is a boolean vector.

    If element_wise is False, this function should have the signature: Callable[[Any], bool], where Any is an element in the column.

    For DataFrameSchema checks, if element_wise=True, fn should have the signature: Callable[[pd.DataFrame], Union[pd.DataFrame, pd.Series, bool]], where the output dataframe or series contains booleans.

    If element_wise is True, fn is applied to each row in the dataframe with the signature Callable[[pd.Series], bool] where the series input is a row in the dataframe.

  • groups (Union[str, List[str], None]) – The dict input to the fn callable will be constrained to the groups specified by groups.

  • groupby (Union[str, List[str], Callable, None]) –

    If a string or list of strings is provided, these columns are used to group the Column series. If a callable is passed, the expected signature is: Callable[ [pd.DataFrame], pd.core.groupby.DataFrameGroupBy]

    The the case of Column checks, this function has access to the entire dataframe, but Column.name is selected from this DataFrameGroupby object so that a SeriesGroupBy object is passed into check_fn.

    Specifying the groupby argument changes the check_fn signature to:

    Callable[[Dict[Union[str, Tuple[str]], pd.Series]], Union[bool, pd.Series]] # noqa

    where the input is a dictionary mapping keys to subsets of the column/dataframe.

  • ignore_na (bool) – If True, null values will be ignored when determining if a check passed or failed. For dataframes, ignores rows with any null value. New in version 0.4.0

  • element_wise (bool) – Whether or not to apply validator in an element-wise fashion. If bool, assumes that all checks should be applied to the column element-wise. If list, should be the same number of elements as checks.

  • name (Optional[str, None]) – optional name for the check.

  • error (Optional[str, None]) – custom error message if series fails validation check.

  • raise_warning (bool) – if True, raise a SchemaWarning and do not throw exception instead of raising a SchemaError for a specific check. This option should be used carefully in cases where a failing check is informational and shouldn’t stop execution of the program.

  • n_failure_cases (Optional[int, None]) – report the first n unique failure cases. If None, report all failure cases.

  • title (Optional[str, None]) – A human-readable label for the check.

  • description (Optional[str, None]) – An arbitrary textual description of the check.

  • statistics (Optional[Dict[str, Any], None]) – kwargs to pass into the check function. These values are serialized and represent the constraints of the checks.

  • strategy (Optional[Any, None]) – A hypothesis strategy, used for implementing data synthesis strategies for this check. See the User Guide for more details.

  • check_kwargs – key-word arguments to pass into check_fn

Example:

The example below uses pandas, but will apply to any of the supported dataframe libraries.

>>> import pandas as pd
>>> import pandera as pa
>>>
>>>
>>> # column checks are vectorized by default
>>> check_positive = pa.Check(lambda s: s > 0)
>>>
>>> # define an element-wise check
>>> check_even = pa.Check(lambda x: x % 2 == 0, element_wise=True)
>>>
>>> # checks can be given human-readable metadata
>>> check_with_metadata = pa.Check(
...     lambda x: True,
...     title="Always passes",
...     description="This check always passes."
... )
>>>
>>> # specify assertions across categorical variables using `groupby`,
>>> # for example, make sure the mean measure for group "A" is always
>>> # larger than the mean measure for group "B"
>>> check_by_group = pa.Check(
...     lambda measures: measures["A"].mean() > measures["B"].mean(),
...     groupby=["group"],
... )
>>>
>>> # define a wide DataFrame-level check
>>> check_dataframe = pa.Check(
...     lambda df: df["measure_1"] > df["measure_2"])
>>>
>>> measure_checks = [check_positive, check_even, check_by_group]
>>>
>>> schema = pa.DataFrameSchema(
...     columns={
...         "measure_1": pa.Column(int, checks=measure_checks),
...         "measure_2": pa.Column(int, checks=measure_checks),
...         "group": pa.Column(str),
...     },
...     checks=check_dataframe
... )
>>>
>>> df = pd.DataFrame({
...     "measure_1": [10, 12, 14, 16],
...     "measure_2": [2, 4, 6, 8],
...     "group": ["B", "B", "A", "A"]
... })
>>>
>>> schema.validate(df)[["measure_1", "measure_2", "group"]]
    measure_1  measure_2 group
0         10          2     B
1         12          4     B
2         14          6     A
3         16          8     A

See here for more usage details.

Attributes

BACKEND_REGISTRY

CHECK_FUNCTION_REGISTRY

REGISTERED_CUSTOM_CHECKS

Methods

__init__(check_fn, groups=None, groupby=None, ignore_na=True, element_wise=False, name=None, error=None, raise_warning=False, n_failure_cases=None, title=None, description=None, statistics=None, strategy=None, **check_kwargs)[source]

Apply a validation function to a data object.

Parameters:
  • check_fn (Callable) –

    A function to check data object. For Column or SeriesSchema checks, if element_wise is True, this function should have the signature: Callable[[pd.Series], Union[pd.Series, bool]], where the output series is a boolean vector.

    If element_wise is False, this function should have the signature: Callable[[Any], bool], where Any is an element in the column.

    For DataFrameSchema checks, if element_wise=True, fn should have the signature: Callable[[pd.DataFrame], Union[pd.DataFrame, pd.Series, bool]], where the output dataframe or series contains booleans.

    If element_wise is True, fn is applied to each row in the dataframe with the signature Callable[[pd.Series], bool] where the series input is a row in the dataframe.

  • groups (Union[str, List[str], None]) – The dict input to the fn callable will be constrained to the groups specified by groups.

  • groupby (Union[str, List[str], Callable, None]) –

    If a string or list of strings is provided, these columns are used to group the Column series. If a callable is passed, the expected signature is: Callable[ [pd.DataFrame], pd.core.groupby.DataFrameGroupBy]

    The the case of Column checks, this function has access to the entire dataframe, but Column.name is selected from this DataFrameGroupby object so that a SeriesGroupBy object is passed into check_fn.

    Specifying the groupby argument changes the check_fn signature to:

    Callable[[Dict[Union[str, Tuple[str]], pd.Series]], Union[bool, pd.Series]] # noqa

    where the input is a dictionary mapping keys to subsets of the column/dataframe.

  • ignore_na (bool) – If True, null values will be ignored when determining if a check passed or failed. For dataframes, ignores rows with any null value. New in version 0.4.0

  • element_wise (bool) – Whether or not to apply validator in an element-wise fashion. If bool, assumes that all checks should be applied to the column element-wise. If list, should be the same number of elements as checks.

  • name (Optional[str, None]) – optional name for the check.

  • error (Optional[str, None]) – custom error message if series fails validation check.

  • raise_warning (bool) – if True, raise a SchemaWarning and do not throw exception instead of raising a SchemaError for a specific check. This option should be used carefully in cases where a failing check is informational and shouldn’t stop execution of the program.

  • n_failure_cases (Optional[int, None]) – report the first n unique failure cases. If None, report all failure cases.

  • title (Optional[str, None]) – A human-readable label for the check.

  • description (Optional[str, None]) – An arbitrary textual description of the check.

  • statistics (Optional[Dict[str, Any], None]) – kwargs to pass into the check function. These values are serialized and represent the constraints of the checks.

  • strategy (Optional[Any, None]) – A hypothesis strategy, used for implementing data synthesis strategies for this check. See the User Guide for more details.

  • check_kwargs – key-word arguments to pass into check_fn

Example:

The example below uses pandas, but will apply to any of the supported dataframe libraries.

>>> import pandas as pd
>>> import pandera as pa
>>>
>>>
>>> # column checks are vectorized by default
>>> check_positive = pa.Check(lambda s: s > 0)
>>>
>>> # define an element-wise check
>>> check_even = pa.Check(lambda x: x % 2 == 0, element_wise=True)
>>>
>>> # checks can be given human-readable metadata
>>> check_with_metadata = pa.Check(
...     lambda x: True,
...     title="Always passes",
...     description="This check always passes."
... )
>>>
>>> # specify assertions across categorical variables using `groupby`,
>>> # for example, make sure the mean measure for group "A" is always
>>> # larger than the mean measure for group "B"
>>> check_by_group = pa.Check(
...     lambda measures: measures["A"].mean() > measures["B"].mean(),
...     groupby=["group"],
... )
>>>
>>> # define a wide DataFrame-level check
>>> check_dataframe = pa.Check(
...     lambda df: df["measure_1"] > df["measure_2"])
>>>
>>> measure_checks = [check_positive, check_even, check_by_group]
>>>
>>> schema = pa.DataFrameSchema(
...     columns={
...         "measure_1": pa.Column(int, checks=measure_checks),
...         "measure_2": pa.Column(int, checks=measure_checks),
...         "group": pa.Column(str),
...     },
...     checks=check_dataframe
... )
>>>
>>> df = pd.DataFrame({
...     "measure_1": [10, 12, 14, 16],
...     "measure_2": [2, 4, 6, 8],
...     "group": ["B", "B", "A", "A"]
... })
>>>
>>> schema.validate(df)[["measure_1", "measure_2", "group"]]
    measure_1  measure_2 group
0         10          2     B
1         12          4     B
2         14          6     A
3         16          8     A

See here for more usage details.

classmethod between(min_value, max_value, include_min=True, include_max=True, **kwargs)[source]

Alias of in_range()

Return type:

Check

classmethod eq(value, **kwargs)[source]

Alias of equal_to()

Return type:

Check

classmethod equal_to(value, **kwargs)[source]

Ensure all elements of a data container equal a certain value.

Parameters:

value (Any) – values in this data object must be equal to this value.

Return type:

Check

classmethod ge(min_value, **kwargs)[source]

Alias of greater_than_or_equal_to()

Return type:

Check

classmethod greater_than(min_value, **kwargs)[source]

Ensure values of a data container are strictly greater than a minimum value.

Parameters:

min_value (Any) – Lower bound to be exceeded. Must be a type comparable to the dtype of the data object to be validated (e.g. a numerical type for float or int and a datetime for datetime).

Return type:

Check

classmethod greater_than_or_equal_to(min_value, **kwargs)[source]

Ensure all values are greater or equal a certain value.

Parameters:

min_value (Any) – Allowed minimum value for values of the data. Must be a type comparable to the dtype of the data object to be validated.

Return type:

Check

classmethod gt(min_value, **kwargs)[source]

Alias of greater_than()

Return type:

Check

classmethod in_range(min_value, max_value, include_min=True, include_max=True, **kwargs)[source]

Ensure all values of a series are within an interval.

Both endpoints must be a type comparable to the dtype of the data object to be validated.

Parameters:
  • min_value (~T) – Left / lower endpoint of the interval.

  • max_value (~T) – Right / upper endpoint of the interval. Must not be smaller than min_value.

  • include_min (bool) – Defines whether min_value is also an allowed value (the default) or whether all values must be strictly greater than min_value.

  • include_max (bool) – Defines whether min_value is also an allowed value (the default) or whether all values must be strictly smaller than max_value.

Return type:

Check

classmethod isin(allowed_values, **kwargs)[source]

Ensure only allowed values occur within a series.

This checks whether all elements of a data object are part of the set of elements of allowed values. If allowed values is a string, the set of elements consists of all distinct characters of the string. Thus only single characters which occur in allowed_values at least once can meet this condition. If you want to check for substrings use Check.str_contains().

Parameters:
  • allowed_values (Iterable) – The set of allowed values. May be any iterable.

  • kwargs – key-word arguments passed into the Check initializer.

Return type:

Check

classmethod le(max_value, **kwargs)[source]

Alias of less_than_or_equal_to()

Return type:

Check

classmethod less_than(max_value, **kwargs)[source]

Ensure values of a series are strictly below a maximum value.

Parameters:

max_value (Any) – All elements of a series must be strictly smaller than this. Must be a type comparable to the dtype of the data object to be validated.

Return type:

Check

classmethod less_than_or_equal_to(max_value, **kwargs)[source]

Ensure values of a series are strictly below a maximum value.

Parameters:

max_value (Any) – Upper bound not to be exceeded. Must be a type comparable to the dtype of the data object to be validated.

Return type:

Check

classmethod lt(max_value, **kwargs)[source]

Alias of less_than()

Return type:

Check

classmethod ne(value, **kwargs)[source]

Alias of not_equal_to()

Return type:

Check

classmethod not_equal_to(value, **kwargs)[source]

Ensure no elements of a data container equals a certain value.

Parameters:

value (Any) – This value must not occur in the data object.

Return type:

Check

classmethod notin(forbidden_values, **kwargs)[source]

Ensure some defined values don’t occur within a series.

Like Check.isin() this check operates on single characters if it is applied on strings. If forbidden_values is a string, it is understood as set of prohibited characters. Any string of length > 1 can’t be in it by design.

Parameters:
  • forbidden_values (Iterable) – The set of values which should not occur. May be any iterable.

  • raise_warning – if True, check raises SchemaWarning instead of SchemaError on validation.

Return type:

Check

classmethod str_contains(pattern, **kwargs)[source]

Ensure that a pattern can be found within each row.

Parameters:
  • pattern (Union[str, Pattern]) – Regular expression pattern to use for searching

  • kwargs – key-word arguments passed into the Check initializer.

Return type:

Check

classmethod str_endswith(string, **kwargs)[source]

Ensure that all values end with a certain string.

Parameters:
  • string (str) – String all values should end with

  • kwargs – key-word arguments passed into the Check initializer.

Return type:

Check

classmethod str_length(min_value=None, max_value=None, **kwargs)[source]

Ensure that the length of strings is within a specified range.

Parameters:
  • min_value (Optional[int, None]) – Minimum length of strings (default: no minimum)

  • max_value (Optional[int, None]) – Maximum length of strings (default: no maximum)

Return type:

Check

classmethod str_matches(pattern, **kwargs)[source]

Ensure that strings start with regular expression match.

Parameters:
  • pattern (Union[str, Pattern]) – Regular expression pattern to use for matching

  • kwargs – key-word arguments passed into the Check initializer.

Return type:

Check

classmethod str_startswith(string, **kwargs)[source]

Ensure that all values start with a certain string.

Parameters:
  • string (str) – String all values should start with

  • kwargs – key-word arguments passed into the Check initializer.

Return type:

Check

classmethod unique_values_eq(values, **kwargs)[source]

Ensure that unique values in the data object contain all values.

Note

In constrast with isin(), this check makes sure that all the items in the values iterable are contained within the series.

Parameters:

values (str) – The set of values that must be present. Maybe any iterable.

Return type:

Check

__call__(check_obj, column=None)[source]

Validate DataFrame or Series.

Parameters:
  • check_obj (Any) – DataFrame of Series to validate.

  • column (Optional[str, None]) – for dataframe checks, apply the check function to this column.

Return type:

CheckResult

Returns:

CheckResult tuple containing:

check_output: boolean scalar, Series or DataFrame indicating which elements passed the check.

check_passed: boolean scalar that indicating whether the check passed overall.

checked_object: the checked object itself. Depending on the options provided to the Check, this will be a Series, DataFrame, or if the groupby option is supported by the validation backend and specified, a Dict[str, Series] or Dict[str, DataFrame] where the keys are distinct groups.

failure_cases: subset of the check_object that failed.