pandera.api.checks.Check¶
- class pandera.api.checks.Check(check_fn, groups=None, groupby=None, ignore_na=True, element_wise=False, name=None, error=None, raise_warning=False, n_failure_cases=None, title=None, description=None, statistics=None, strategy=None, determined_by_unique=False, **check_kwargs)[source]¶
Check a data object for certain properties.
Apply a validation function to a data object.
- Parameters:
check_fn (
Callable) –A function to check data object. For Column or SeriesSchema checks, if element_wise is False, this function should have the signature:
Callable[[pd.Series], Union[pd.Series, bool]], where the output series is a boolean vector.If element_wise is True, this function should have the signature:
Callable[[Any], bool], whereAnyis an element in the column.For DataFrameSchema checks, if element_wise=False, fn should have the signature:
Callable[[pd.DataFrame], Union[pd.DataFrame, pd.Series, bool]], where the output dataframe or series contains booleans.If element_wise is True, fn is applied to each row in the dataframe with the signature
Callable[[pd.Series], bool]where the series input is a row in the dataframe.groups (
Union[str,list[str],None]) – The dict input to the fn callable will be constrained to the groups specified by groups.groupby (
Union[str,list[str],Callable,None]) –If a string or list of strings is provided, these columns are used to group the Column series. If a callable is passed, the expected signature is:
Callable[ [pd.DataFrame], pd.core.groupby.DataFrameGroupBy]The the case of
Columnchecks, this function has access to the entire dataframe, butColumn.nameis selected from this DataFrameGroupby object so that a SeriesGroupBy object is passed intocheck_fn.Specifying the groupby argument changes the
check_fnsignature to:Callable[[Dict[Union[str, Tuple[str]], pd.Series]], Union[bool, pd.Series]]# noqawhere the input is a dictionary mapping keys to subsets of the column/dataframe.
ignore_na (
bool) – If True, null values will be ignored when determining if a check passed or failed. For dataframes, ignores rows with any null value. New in version 0.4.0element_wise (
bool) – Whether or not to apply validator in an element-wise fashion. If bool, assumes that all checks should be applied to the column element-wise. If list, should be the same number of elements as checks.error (
UnionType[str,None]) – custom error message if series fails validation check.raise_warning (
bool) – if True, raise a SchemaWarning and do not throw exception instead of raising a SchemaError for a specific check. This option should be used carefully in cases where a failing check is informational and shouldn’t stop execution of the program.n_failure_cases (
UnionType[int,None]) – report the first n unique failure cases. If None, report all failure cases.title (
UnionType[str,None]) – A human-readable label for the check.description (
UnionType[str,None]) – An arbitrary textual description of the check.statistics (
UnionType[dict[str,Any],None]) – kwargs to pass into the check function. These values are serialized and represent the constraints of the checks.strategy (
UnionType[Any,None]) – A hypothesis strategy, used for implementing data synthesis strategies for this check. See the User Guide for more details.determined_by_unique (
bool) – If True, indicates that this check’s result is fully determined by the unique values in the data, meaning duplicate values don’t affect the outcome. This enables significant performance optimizations for MultiIndex validation when dealing with large datasets. If True, the check function must produce the same result whether applied to unique values or full values.check_kwargs – key-word arguments to pass into
check_fn
- Example:
The example below uses
pandas, but will apply to any of the supported dataframe libraries.>>> import pandas as pd >>> import pandera.pandas as pa >>> >>> >>> # column checks are vectorized by default >>> check_positive = pa.Check(lambda s: s > 0) >>> >>> # define an element-wise check >>> check_even = pa.Check(lambda x: x % 2 == 0, element_wise=True) >>> >>> # checks can be given human-readable metadata >>> check_with_metadata = pa.Check( ... lambda x: True, ... title="Always passes", ... description="This check always passes." ... ) >>> >>> # specify assertions across categorical variables using `groupby`, >>> # for example, make sure the mean measure for group "A" is always >>> # larger than the mean measure for group "B" >>> check_by_group = pa.Check( ... lambda measures: measures["A"].mean() > measures["B"].mean(), ... groupby=["group"], ... ) >>> >>> # define a wide DataFrame-level check >>> check_dataframe = pa.Check( ... lambda df: df["measure_1"] > df["measure_2"]) >>> >>> measure_checks = [check_positive, check_even, check_by_group] >>> >>> schema = pa.DataFrameSchema( ... columns={ ... "measure_1": pa.Column(int, checks=measure_checks), ... "measure_2": pa.Column(int, checks=measure_checks), ... "group": pa.Column(str), ... }, ... checks=check_dataframe ... ) >>> >>> df = pd.DataFrame({ ... "measure_1": [10, 12, 14, 16], ... "measure_2": [2, 4, 6, 8], ... "group": ["B", "B", "A", "A"] ... }) >>> >>> schema.validate(df)[["measure_1", "measure_2", "group"]] measure_1 measure_2 group 0 10 2 B 1 12 4 B 2 14 6 A 3 16 8 A
See here for more usage details.
Attributes
one_sample_ttesttwo_sample_ttestBACKEND_REGISTRYCHECK_FUNCTION_REGISTRYREGISTERED_CUSTOM_CHECKSMethods
- __init__(check_fn, groups=None, groupby=None, ignore_na=True, element_wise=False, name=None, error=None, raise_warning=False, n_failure_cases=None, title=None, description=None, statistics=None, strategy=None, determined_by_unique=False, **check_kwargs)[source]¶
Apply a validation function to a data object.
- Parameters:
check_fn (
Callable) –A function to check data object. For Column or SeriesSchema checks, if element_wise is False, this function should have the signature:
Callable[[pd.Series], Union[pd.Series, bool]], where the output series is a boolean vector.If element_wise is True, this function should have the signature:
Callable[[Any], bool], whereAnyis an element in the column.For DataFrameSchema checks, if element_wise=False, fn should have the signature:
Callable[[pd.DataFrame], Union[pd.DataFrame, pd.Series, bool]], where the output dataframe or series contains booleans.If element_wise is True, fn is applied to each row in the dataframe with the signature
Callable[[pd.Series], bool]where the series input is a row in the dataframe.groups (
Union[str,list[str],None]) – The dict input to the fn callable will be constrained to the groups specified by groups.groupby (
Union[str,list[str],Callable,None]) –If a string or list of strings is provided, these columns are used to group the Column series. If a callable is passed, the expected signature is:
Callable[ [pd.DataFrame], pd.core.groupby.DataFrameGroupBy]The the case of
Columnchecks, this function has access to the entire dataframe, butColumn.nameis selected from this DataFrameGroupby object so that a SeriesGroupBy object is passed intocheck_fn.Specifying the groupby argument changes the
check_fnsignature to:Callable[[Dict[Union[str, Tuple[str]], pd.Series]], Union[bool, pd.Series]]# noqawhere the input is a dictionary mapping keys to subsets of the column/dataframe.
ignore_na (
bool) – If True, null values will be ignored when determining if a check passed or failed. For dataframes, ignores rows with any null value. New in version 0.4.0element_wise (
bool) – Whether or not to apply validator in an element-wise fashion. If bool, assumes that all checks should be applied to the column element-wise. If list, should be the same number of elements as checks.error (
UnionType[str,None]) – custom error message if series fails validation check.raise_warning (
bool) – if True, raise a SchemaWarning and do not throw exception instead of raising a SchemaError for a specific check. This option should be used carefully in cases where a failing check is informational and shouldn’t stop execution of the program.n_failure_cases (
UnionType[int,None]) – report the first n unique failure cases. If None, report all failure cases.title (
UnionType[str,None]) – A human-readable label for the check.description (
UnionType[str,None]) – An arbitrary textual description of the check.statistics (
UnionType[dict[str,Any],None]) – kwargs to pass into the check function. These values are serialized and represent the constraints of the checks.strategy (
UnionType[Any,None]) – A hypothesis strategy, used for implementing data synthesis strategies for this check. See the User Guide for more details.determined_by_unique (
bool) – If True, indicates that this check’s result is fully determined by the unique values in the data, meaning duplicate values don’t affect the outcome. This enables significant performance optimizations for MultiIndex validation when dealing with large datasets. If True, the check function must produce the same result whether applied to unique values or full values.check_kwargs – key-word arguments to pass into
check_fn
- Example:
The example below uses
pandas, but will apply to any of the supported dataframe libraries.>>> import pandas as pd >>> import pandera.pandas as pa >>> >>> >>> # column checks are vectorized by default >>> check_positive = pa.Check(lambda s: s > 0) >>> >>> # define an element-wise check >>> check_even = pa.Check(lambda x: x % 2 == 0, element_wise=True) >>> >>> # checks can be given human-readable metadata >>> check_with_metadata = pa.Check( ... lambda x: True, ... title="Always passes", ... description="This check always passes." ... ) >>> >>> # specify assertions across categorical variables using `groupby`, >>> # for example, make sure the mean measure for group "A" is always >>> # larger than the mean measure for group "B" >>> check_by_group = pa.Check( ... lambda measures: measures["A"].mean() > measures["B"].mean(), ... groupby=["group"], ... ) >>> >>> # define a wide DataFrame-level check >>> check_dataframe = pa.Check( ... lambda df: df["measure_1"] > df["measure_2"]) >>> >>> measure_checks = [check_positive, check_even, check_by_group] >>> >>> schema = pa.DataFrameSchema( ... columns={ ... "measure_1": pa.Column(int, checks=measure_checks), ... "measure_2": pa.Column(int, checks=measure_checks), ... "group": pa.Column(str), ... }, ... checks=check_dataframe ... ) >>> >>> df = pd.DataFrame({ ... "measure_1": [10, 12, 14, 16], ... "measure_2": [2, 4, 6, 8], ... "group": ["B", "B", "A", "A"] ... }) >>> >>> schema.validate(df)[["measure_1", "measure_2", "group"]] measure_1 measure_2 group 0 10 2 B 1 12 4 B 2 14 6 A 3 16 8 A
See here for more usage details.
- classmethod between(min_value, max_value, include_min=True, include_max=True, **kwargs)[source]¶
Alias of
in_range()- Return type:
- classmethod cf_has_cell_methods(expected, **kwargs)[source]¶
Require
cell_methodsattr equals expected.Lightweight CF check that inspects
.attrs["cell_methods"].- Return type:
- classmethod cf_has_standard_names(names, **kwargs)[source]¶
Require
cf_xarraycan resolve each standard name.Needs
cf_xarrayinstalled. Each name must be resolvable viadata.cf[name].- Return type:
- classmethod cf_standard_name(expected_name, **kwargs)[source]¶
Require
standard_nameattr equals expected_name.Lightweight CF check that inspects
.attrs["standard_name"]without requiringcf_xarray.- Return type:
- classmethod cf_units(expected_units, **kwargs)[source]¶
Require
unitsattr equals expected_units.Lightweight CF check that inspects
.attrs["units"]without requiringcf_xarray.- Return type:
- classmethod dim_size(dim, size, **kwargs)[source]¶
Assert
data.sizes[dim] == size.Prefer schema
sizes={dim: size}when defining aDataArraySchemaorDatasetSchema.- Return type:
- classmethod eq(value, **kwargs)[source]¶
Alias of
equal_to()- Return type:
- classmethod equal_to(value, **kwargs)[source]¶
Ensure all elements of a data container equal a certain value.
- classmethod ge(min_value, **kwargs)[source]¶
Alias of
greater_than_or_equal_to()- Return type:
- classmethod greater_than(min_value, **kwargs)[source]¶
Ensure values of a data container are strictly greater than a minimum value.
- classmethod greater_than_or_equal_to(min_value, **kwargs)[source]¶
Ensure all values are greater or equal a certain value.
- classmethod gt(min_value, **kwargs)[source]¶
Alias of
greater_than()- Return type:
- classmethod has_attrs(attrs, **kwargs)[source]¶
Match key-value pairs on
.attrs(xarray).Prefer schema
attrs=onDataArraySchema/DatasetSchemawhen that is the primary contract.- Return type:
- classmethod has_coords(coords, **kwargs)[source]¶
Require coordinate names on an xarray object.
Prefer schema
coords=when declaring a fullDataArraySchema/DatasetSchema.- Return type:
- classmethod has_dims(dims, **kwargs)[source]¶
Require dimension names (order-independent) on an xarray object.
Prefer
DataArraySchema/DatasetSchemadims=when defining a schema; use this for dataset-level or ad hoc checks.- Return type:
- classmethod has_encoding(encoding, **kwargs)[source]¶
Match key-value pairs on
.encoding(xarray).Prefer schema
encoding=onDataArraySchema/DatasetSchemawhen that is the primary contract.- Return type:
- classmethod in_range(*args, min_value=None, max_value=None, include_min=True, include_max=True, **kwargs)[source]¶
Ensure all values of a series are within an interval.
Both endpoints must be a type comparable to the dtype of the data object to be validated.
- Parameters:
args – Positional arguments. If a single value is provided, it represents the exact value. If two values are provided, they represent min_value and max_value respectively. If three values are provided, they represent min_value, max_value, and include_min respectively. If four values are provided, they represent min_value, max_value, include_min, and include_max respectively.
min_value (
Optional[~T]) – Left / lower endpoint of the interval.max_value (
Optional[~T]) – Right / upper endpoint of the interval. Must not be smaller than min_value.include_min (
bool) – Defines whether min_value is also an allowed value (the default) or whether all values must be strictly greater than min_value.include_max (
bool) – Defines whether min_value is also an allowed value (the default) or whether all values must be strictly smaller than max_value.
- Example:
>>> import pandera as pa >>> >>> positional_check = pa.Check.in_range(0, 1) >>> positional_include_min_check = pa.Check.in_range(0, 1, True) >>> positional_include_min_max_check = pa.Check.in_range(0, 1, True, True) >>> keyword_check = pa.Check.in_range(min_value=0, max_value=1) >>> keyword_include_min_check = pa.Check.in_range(min_value=0, max_value=1, include_min=True) >>> keyword_include_min_max_check = pa.Check.in_range(min_value=0, max_value=1, include_min=True, include_max=True)
- Return type:
- classmethod is_monotonic(dim, increasing=True, **kwargs)[source]¶
Assert a 1-D coordinate is strictly monotonic along
dim.This is a value constraint on coordinate labels, not usually expressed by
dims/sizesalone.- Return type:
- classmethod isin(*args, allowed_values=None, **kwargs)[source]¶
Ensure only allowed values occur within a series.
This checks whether all elements of a data object are part of the set of elements of allowed values. If allowed values is a string, the set of elements consists of all distinct characters of the string. Thus only single characters which occur in allowed_values at least once can meet this condition. If you want to check for substrings use
Check.str_contains().- Parameters:
args – Positional arguments. If a single list/tuple is provided, it represents the allowed values. If multiple values are provided, they represent the allowed values.
allowed_values (
UnionType[Iterable,None]) – The set of allowed values. May be any iterable.kwargs – key-word arguments passed into the Check initializer.
- Example:
>>> import pandera as pa >>> >>> positional_check = pa.Check.isin([1, 2, 3]) >>> positional_values_check = pa.Check.isin(1, 2, 3) >>> keyword_check = pa.Check.isin(allowed_values=[1, 2, 3]) >>> keyword_values_check = pa.Check.isin(allowed_values=[1, 2, 3])
- Return type:
- classmethod le(max_value, **kwargs)[source]¶
Alias of
less_than_or_equal_to()- Return type:
- classmethod less_than(max_value, **kwargs)[source]¶
Ensure values of a series are strictly below a maximum value.
- classmethod less_than_or_equal_to(max_value, **kwargs)[source]¶
Ensure values of a series are strictly below a maximum value.
- classmethod lt(max_value, **kwargs)[source]¶
Alias of
less_than()- Return type:
- classmethod ndim(n, **kwargs)[source]¶
Assert dimensionality (
DataArray.ndimorlen(Dataset.dims)).Often redundant with an explicit
dims=tuple on the schema; kept for dataset-level checks and parity with a single scalar constraint.- Return type:
- classmethod ne(value, **kwargs)[source]¶
Alias of
not_equal_to()- Return type:
- classmethod no_duplicates_in_coord(coord, **kwargs)[source]¶
Assert coordinate values are unique.
A value-level constraint on the coordinate index; not implied by schema
dimsorcoordspresence alone.- Return type:
- classmethod not_equal_to(value, **kwargs)[source]¶
Ensure no elements of a data container equals a certain value.
- classmethod notin(*args, forbidden_values=None, **kwargs)[source]¶
Ensure some defined values don’t occur within a series.
Like
Check.isin()this check operates on single characters if it is applied on strings. If forbidden_values is a string, it is understood as set of prohibited characters. Any string of length > 1 can’t be in it by design.- Parameters:
args – Positional arguments. If a single list/tuple is provided, it represents the forbidden values. If multiple values are provided, they represent the forbidden values.
forbidden_values (
UnionType[Iterable,None]) – The set of values which should not occur. May be any iterable.raise_warning – if True, check raises SchemaWarning instead of SchemaError on validation.
- Example:
>>> import pandera as pa >>> >>> positional_check = pa.Check.notin([1, 2, 3]) >>> positional_values_check = pa.Check.notin(1, 2, 3) >>> keyword_check = pa.Check.notin(forbidden_values=[1, 2, 3])
- Return type:
- classmethod str_contains(pattern, **kwargs)[source]¶
Ensure that a pattern can be found within each row.
- classmethod str_endswith(string, **kwargs)[source]¶
Ensure that all values end with a certain string.
- classmethod str_length(*args, min_value=None, max_value=None, exact_value=None, **kwargs)[source]¶
Ensure that the length of strings is within a specified range.
This method supports multiple calling conventions:
Check.str_length(5) # exact length of 5 Check.str_length(1, 5) # length between 1 and 5 (inclusive) Check.str_length(min_value=1, max_value=5) # same as above Check.str_length(min_value=1) # length >= 1 Check.str_length(max_value=5) # length <= 5
- Parameters:
args – Positional arguments. If one value is provided, it represents the exact length. If two values are provided, they represent min_value and max_value respectively.
min_value (
UnionType[int,None]) – Minimum length of strings (default: no minimum)max_value (
UnionType[int,None]) – Maximum length of strings (default: no maximum)exact_value (
UnionType[int,None]) – Exact length of strings. (default: no exact value)kwargs – key-word arguments passed into the Check initializer.
- Return type:
- classmethod str_matches(pattern, **kwargs)[source]¶
Ensure that strings start with regular expression match.
- classmethod str_startswith(string, **kwargs)[source]¶
Ensure that all values start with a certain string.
- classmethod unique_values_eq(values, **kwargs)[source]¶
Ensure that unique values in the data object contain all values.
Note
In contrast with
isin(), this check makes sure that all the items in thevaluesiterable are contained within the series.
- __call__(check_obj, column=None)[source]¶
Validate DataFrame or Series.
- Parameters:
- Return type:
CheckResult- Returns:
CheckResult tuple containing:
check_output: boolean scalar,SeriesorDataFrameindicating which elements passed the check.check_passed: boolean scalar that indicating whether the check passed overall.checked_object: the checked object itself. Depending on the options provided to theCheck, this will be a Series, DataFrame, or if thegroupbyoption is supported by the validation backend and specified, aDict[str, Series]orDict[str, DataFrame]where the keys are distinct groups.failure_cases: subset of the check_object that failed.