pandera.api.checks.Check¶
- class pandera.api.checks.Check(check_fn, groups=None, groupby=None, ignore_na=True, element_wise=False, name=None, error=None, raise_warning=False, n_failure_cases=None, title=None, description=None, statistics=None, strategy=None, **check_kwargs)[source]¶
Check a data object for certain properties.
Apply a validation function to a data object.
- Parameters:
check_fn (
Callable
) –A function to check data object. For Column or SeriesSchema checks, if element_wise is True, this function should have the signature:
Callable[[pd.Series], Union[pd.Series, bool]]
, where the output series is a boolean vector.If element_wise is False, this function should have the signature:
Callable[[Any], bool]
, whereAny
is an element in the column.For DataFrameSchema checks, if element_wise=True, fn should have the signature:
Callable[[pd.DataFrame], Union[pd.DataFrame, pd.Series, bool]]
, where the output dataframe or series contains booleans.If element_wise is True, fn is applied to each row in the dataframe with the signature
Callable[[pd.Series], bool]
where the series input is a row in the dataframe.groups (
Union
[str
,List
[str
],None
]) – The dict input to the fn callable will be constrained to the groups specified by groups.groupby (
Union
[str
,List
[str
],Callable
,None
]) –If a string or list of strings is provided, these columns are used to group the Column series. If a callable is passed, the expected signature is:
Callable[ [pd.DataFrame], pd.core.groupby.DataFrameGroupBy]
The the case of
Column
checks, this function has access to the entire dataframe, butColumn.name
is selected from this DataFrameGroupby object so that a SeriesGroupBy object is passed intocheck_fn
.Specifying the groupby argument changes the
check_fn
signature to:Callable[[Dict[Union[str, Tuple[str]], pd.Series]], Union[bool, pd.Series]]
# noqawhere the input is a dictionary mapping keys to subsets of the column/dataframe.
ignore_na (
bool
) – If True, null values will be ignored when determining if a check passed or failed. For dataframes, ignores rows with any null value. New in version 0.4.0element_wise (
bool
) – Whether or not to apply validator in an element-wise fashion. If bool, assumes that all checks should be applied to the column element-wise. If list, should be the same number of elements as checks.error (
Optional
[str
,None
]) – custom error message if series fails validation check.raise_warning (
bool
) – if True, raise a SchemaWarning and do not throw exception instead of raising a SchemaError for a specific check. This option should be used carefully in cases where a failing check is informational and shouldn’t stop execution of the program.n_failure_cases (
Optional
[int
,None
]) – report the first n unique failure cases. If None, report all failure cases.title (
Optional
[str
,None
]) – A human-readable label for the check.description (
Optional
[str
,None
]) – An arbitrary textual description of the check.statistics (
Optional
[Dict
[str
,Any
],None
]) – kwargs to pass into the check function. These values are serialized and represent the constraints of the checks.strategy (
Optional
[Any
,None
]) – A hypothesis strategy, used for implementing data synthesis strategies for this check. See the User Guide for more details.check_kwargs – key-word arguments to pass into
check_fn
- Example:
The example below uses
pandas
, but will apply to any of the supported dataframe libraries.>>> import pandas as pd >>> import pandera as pa >>> >>> >>> # column checks are vectorized by default >>> check_positive = pa.Check(lambda s: s > 0) >>> >>> # define an element-wise check >>> check_even = pa.Check(lambda x: x % 2 == 0, element_wise=True) >>> >>> # checks can be given human-readable metadata >>> check_with_metadata = pa.Check( ... lambda x: True, ... title="Always passes", ... description="This check always passes." ... ) >>> >>> # specify assertions across categorical variables using `groupby`, >>> # for example, make sure the mean measure for group "A" is always >>> # larger than the mean measure for group "B" >>> check_by_group = pa.Check( ... lambda measures: measures["A"].mean() > measures["B"].mean(), ... groupby=["group"], ... ) >>> >>> # define a wide DataFrame-level check >>> check_dataframe = pa.Check( ... lambda df: df["measure_1"] > df["measure_2"]) >>> >>> measure_checks = [check_positive, check_even, check_by_group] >>> >>> schema = pa.DataFrameSchema( ... columns={ ... "measure_1": pa.Column(int, checks=measure_checks), ... "measure_2": pa.Column(int, checks=measure_checks), ... "group": pa.Column(str), ... }, ... checks=check_dataframe ... ) >>> >>> df = pd.DataFrame({ ... "measure_1": [10, 12, 14, 16], ... "measure_2": [2, 4, 6, 8], ... "group": ["B", "B", "A", "A"] ... }) >>> >>> schema.validate(df)[["measure_1", "measure_2", "group"]] measure_1 measure_2 group 0 10 2 B 1 12 4 B 2 14 6 A 3 16 8 A
See here for more usage details.
Attributes
BACKEND_REGISTRY
CHECK_FUNCTION_REGISTRY
REGISTERED_CUSTOM_CHECKS
Methods
- __init__(check_fn, groups=None, groupby=None, ignore_na=True, element_wise=False, name=None, error=None, raise_warning=False, n_failure_cases=None, title=None, description=None, statistics=None, strategy=None, **check_kwargs)[source]¶
Apply a validation function to a data object.
- Parameters:
check_fn (
Callable
) –A function to check data object. For Column or SeriesSchema checks, if element_wise is True, this function should have the signature:
Callable[[pd.Series], Union[pd.Series, bool]]
, where the output series is a boolean vector.If element_wise is False, this function should have the signature:
Callable[[Any], bool]
, whereAny
is an element in the column.For DataFrameSchema checks, if element_wise=True, fn should have the signature:
Callable[[pd.DataFrame], Union[pd.DataFrame, pd.Series, bool]]
, where the output dataframe or series contains booleans.If element_wise is True, fn is applied to each row in the dataframe with the signature
Callable[[pd.Series], bool]
where the series input is a row in the dataframe.groups (
Union
[str
,List
[str
],None
]) – The dict input to the fn callable will be constrained to the groups specified by groups.groupby (
Union
[str
,List
[str
],Callable
,None
]) –If a string or list of strings is provided, these columns are used to group the Column series. If a callable is passed, the expected signature is:
Callable[ [pd.DataFrame], pd.core.groupby.DataFrameGroupBy]
The the case of
Column
checks, this function has access to the entire dataframe, butColumn.name
is selected from this DataFrameGroupby object so that a SeriesGroupBy object is passed intocheck_fn
.Specifying the groupby argument changes the
check_fn
signature to:Callable[[Dict[Union[str, Tuple[str]], pd.Series]], Union[bool, pd.Series]]
# noqawhere the input is a dictionary mapping keys to subsets of the column/dataframe.
ignore_na (
bool
) – If True, null values will be ignored when determining if a check passed or failed. For dataframes, ignores rows with any null value. New in version 0.4.0element_wise (
bool
) – Whether or not to apply validator in an element-wise fashion. If bool, assumes that all checks should be applied to the column element-wise. If list, should be the same number of elements as checks.error (
Optional
[str
,None
]) – custom error message if series fails validation check.raise_warning (
bool
) – if True, raise a SchemaWarning and do not throw exception instead of raising a SchemaError for a specific check. This option should be used carefully in cases where a failing check is informational and shouldn’t stop execution of the program.n_failure_cases (
Optional
[int
,None
]) – report the first n unique failure cases. If None, report all failure cases.title (
Optional
[str
,None
]) – A human-readable label for the check.description (
Optional
[str
,None
]) – An arbitrary textual description of the check.statistics (
Optional
[Dict
[str
,Any
],None
]) – kwargs to pass into the check function. These values are serialized and represent the constraints of the checks.strategy (
Optional
[Any
,None
]) – A hypothesis strategy, used for implementing data synthesis strategies for this check. See the User Guide for more details.check_kwargs – key-word arguments to pass into
check_fn
- Example:
The example below uses
pandas
, but will apply to any of the supported dataframe libraries.>>> import pandas as pd >>> import pandera as pa >>> >>> >>> # column checks are vectorized by default >>> check_positive = pa.Check(lambda s: s > 0) >>> >>> # define an element-wise check >>> check_even = pa.Check(lambda x: x % 2 == 0, element_wise=True) >>> >>> # checks can be given human-readable metadata >>> check_with_metadata = pa.Check( ... lambda x: True, ... title="Always passes", ... description="This check always passes." ... ) >>> >>> # specify assertions across categorical variables using `groupby`, >>> # for example, make sure the mean measure for group "A" is always >>> # larger than the mean measure for group "B" >>> check_by_group = pa.Check( ... lambda measures: measures["A"].mean() > measures["B"].mean(), ... groupby=["group"], ... ) >>> >>> # define a wide DataFrame-level check >>> check_dataframe = pa.Check( ... lambda df: df["measure_1"] > df["measure_2"]) >>> >>> measure_checks = [check_positive, check_even, check_by_group] >>> >>> schema = pa.DataFrameSchema( ... columns={ ... "measure_1": pa.Column(int, checks=measure_checks), ... "measure_2": pa.Column(int, checks=measure_checks), ... "group": pa.Column(str), ... }, ... checks=check_dataframe ... ) >>> >>> df = pd.DataFrame({ ... "measure_1": [10, 12, 14, 16], ... "measure_2": [2, 4, 6, 8], ... "group": ["B", "B", "A", "A"] ... }) >>> >>> schema.validate(df)[["measure_1", "measure_2", "group"]] measure_1 measure_2 group 0 10 2 B 1 12 4 B 2 14 6 A 3 16 8 A
See here for more usage details.
- classmethod between(min_value, max_value, include_min=True, include_max=True, **kwargs)[source]¶
Alias of
in_range()
- Return type:
- classmethod eq(value, **kwargs)[source]¶
Alias of
equal_to()
- Return type:
- classmethod equal_to(value, **kwargs)[source]¶
Ensure all elements of a data container equal a certain value.
- classmethod ge(min_value, **kwargs)[source]¶
Alias of
greater_than_or_equal_to()
- Return type:
- classmethod greater_than(min_value, **kwargs)[source]¶
Ensure values of a data container are strictly greater than a minimum value.
- classmethod greater_than_or_equal_to(min_value, **kwargs)[source]¶
Ensure all values are greater or equal a certain value.
- classmethod gt(min_value, **kwargs)[source]¶
Alias of
greater_than()
- Return type:
- classmethod in_range(min_value, max_value, include_min=True, include_max=True, **kwargs)[source]¶
Ensure all values of a series are within an interval.
Both endpoints must be a type comparable to the dtype of the data object to be validated.
- Parameters:
min_value (~T) – Left / lower endpoint of the interval.
max_value (~T) – Right / upper endpoint of the interval. Must not be smaller than min_value.
include_min (
bool
) – Defines whether min_value is also an allowed value (the default) or whether all values must be strictly greater than min_value.include_max (
bool
) – Defines whether min_value is also an allowed value (the default) or whether all values must be strictly smaller than max_value.
- Return type:
- classmethod isin(allowed_values, **kwargs)[source]¶
Ensure only allowed values occur within a series.
This checks whether all elements of a data object are part of the set of elements of allowed values. If allowed values is a string, the set of elements consists of all distinct characters of the string. Thus only single characters which occur in allowed_values at least once can meet this condition. If you want to check for substrings use
Check.str_contains()
.
- classmethod le(max_value, **kwargs)[source]¶
Alias of
less_than_or_equal_to()
- Return type:
- classmethod less_than(max_value, **kwargs)[source]¶
Ensure values of a series are strictly below a maximum value.
- classmethod less_than_or_equal_to(max_value, **kwargs)[source]¶
Ensure values of a series are strictly below a maximum value.
- classmethod lt(max_value, **kwargs)[source]¶
Alias of
less_than()
- Return type:
- classmethod ne(value, **kwargs)[source]¶
Alias of
not_equal_to()
- Return type:
- classmethod not_equal_to(value, **kwargs)[source]¶
Ensure no elements of a data container equals a certain value.
- classmethod notin(forbidden_values, **kwargs)[source]¶
Ensure some defined values don’t occur within a series.
Like
Check.isin()
this check operates on single characters if it is applied on strings. If forbidden_values is a string, it is understood as set of prohibited characters. Any string of length > 1 can’t be in it by design.
- classmethod str_contains(pattern, **kwargs)[source]¶
Ensure that a pattern can be found within each row.
- classmethod str_endswith(string, **kwargs)[source]¶
Ensure that all values end with a certain string.
- classmethod str_length(min_value=None, max_value=None, **kwargs)[source]¶
Ensure that the length of strings is within a specified range.
- classmethod str_matches(pattern, **kwargs)[source]¶
Ensure that strings start with regular expression match.
- classmethod str_startswith(string, **kwargs)[source]¶
Ensure that all values start with a certain string.
- classmethod unique_values_eq(values, **kwargs)[source]¶
Ensure that unique values in the data object contain all values.
Note
In constrast with
isin()
, this check makes sure that all the items in thevalues
iterable are contained within the series.
- __call__(check_obj, column=None)[source]¶
Validate DataFrame or Series.
- Parameters:
- Return type:
CheckResult
- Returns:
CheckResult tuple containing:
check_output
: boolean scalar,Series
orDataFrame
indicating which elements passed the check.check_passed
: boolean scalar that indicating whether the check passed overall.checked_object
: the checked object itself. Depending on the options provided to theCheck
, this will be a Series, DataFrame, or if thegroupby
option is supported by the validation backend and specified, aDict[str, Series]
orDict[str, DataFrame]
where the keys are distinct groups.failure_cases
: subset of the check_object that failed.