pandera.api.checks.Check

class pandera.api.checks.Check(check_fn, groups=None, groupby=None, ignore_na=True, element_wise=False, name=None, error=None, raise_warning=False, n_failure_cases=None, title=None, description=None, statistics=None, strategy=None, determined_by_unique=False, **check_kwargs)[source]

Check a data object for certain properties.

Apply a validation function to a data object.

Parameters:
  • check_fn (Callable) –

    A function to check data object. For Column or SeriesSchema checks, if element_wise is False, this function should have the signature: Callable[[pd.Series], Union[pd.Series, bool]], where the output series is a boolean vector.

    If element_wise is True, this function should have the signature: Callable[[Any], bool], where Any is an element in the column.

    For DataFrameSchema checks, if element_wise=False, fn should have the signature: Callable[[pd.DataFrame], Union[pd.DataFrame, pd.Series, bool]], where the output dataframe or series contains booleans.

    If element_wise is True, fn is applied to each row in the dataframe with the signature Callable[[pd.Series], bool] where the series input is a row in the dataframe.

  • groups (Union[str, list[str], None]) – The dict input to the fn callable will be constrained to the groups specified by groups.

  • groupby (Union[str, list[str], Callable, None]) –

    If a string or list of strings is provided, these columns are used to group the Column series. If a callable is passed, the expected signature is: Callable[ [pd.DataFrame], pd.core.groupby.DataFrameGroupBy]

    The the case of Column checks, this function has access to the entire dataframe, but Column.name is selected from this DataFrameGroupby object so that a SeriesGroupBy object is passed into check_fn.

    Specifying the groupby argument changes the check_fn signature to:

    Callable[[Dict[Union[str, Tuple[str]], pd.Series]], Union[bool, pd.Series]] # noqa

    where the input is a dictionary mapping keys to subsets of the column/dataframe.

  • ignore_na (bool) – If True, null values will be ignored when determining if a check passed or failed. For dataframes, ignores rows with any null value. New in version 0.4.0

  • element_wise (bool) – Whether or not to apply validator in an element-wise fashion. If bool, assumes that all checks should be applied to the column element-wise. If list, should be the same number of elements as checks.

  • name (UnionType[str, None]) – optional name for the check.

  • error (UnionType[str, None]) – custom error message if series fails validation check.

  • raise_warning (bool) – if True, raise a SchemaWarning and do not throw exception instead of raising a SchemaError for a specific check. This option should be used carefully in cases where a failing check is informational and shouldn’t stop execution of the program.

  • n_failure_cases (UnionType[int, None]) – report the first n unique failure cases. If None, report all failure cases.

  • title (UnionType[str, None]) – A human-readable label for the check.

  • description (UnionType[str, None]) – An arbitrary textual description of the check.

  • statistics (UnionType[dict[str, Any], None]) – kwargs to pass into the check function. These values are serialized and represent the constraints of the checks.

  • strategy (UnionType[Any, None]) – A hypothesis strategy, used for implementing data synthesis strategies for this check. See the User Guide for more details.

  • determined_by_unique (bool) – If True, indicates that this check’s result is fully determined by the unique values in the data, meaning duplicate values don’t affect the outcome. This enables significant performance optimizations for MultiIndex validation when dealing with large datasets. If True, the check function must produce the same result whether applied to unique values or full values.

  • check_kwargs – key-word arguments to pass into check_fn

Example:

The example below uses pandas, but will apply to any of the supported dataframe libraries.

>>> import pandas as pd
>>> import pandera.pandas as pa
>>>
>>>
>>> # column checks are vectorized by default
>>> check_positive = pa.Check(lambda s: s > 0)
>>>
>>> # define an element-wise check
>>> check_even = pa.Check(lambda x: x % 2 == 0, element_wise=True)
>>>
>>> # checks can be given human-readable metadata
>>> check_with_metadata = pa.Check(
...     lambda x: True,
...     title="Always passes",
...     description="This check always passes."
... )
>>>
>>> # specify assertions across categorical variables using `groupby`,
>>> # for example, make sure the mean measure for group "A" is always
>>> # larger than the mean measure for group "B"
>>> check_by_group = pa.Check(
...     lambda measures: measures["A"].mean() > measures["B"].mean(),
...     groupby=["group"],
... )
>>>
>>> # define a wide DataFrame-level check
>>> check_dataframe = pa.Check(
...     lambda df: df["measure_1"] > df["measure_2"])
>>>
>>> measure_checks = [check_positive, check_even, check_by_group]
>>>
>>> schema = pa.DataFrameSchema(
...     columns={
...         "measure_1": pa.Column(int, checks=measure_checks),
...         "measure_2": pa.Column(int, checks=measure_checks),
...         "group": pa.Column(str),
...     },
...     checks=check_dataframe
... )
>>>
>>> df = pd.DataFrame({
...     "measure_1": [10, 12, 14, 16],
...     "measure_2": [2, 4, 6, 8],
...     "group": ["B", "B", "A", "A"]
... })
>>>
>>> schema.validate(df)[["measure_1", "measure_2", "group"]]
    measure_1  measure_2 group
0         10          2     B
1         12          4     B
2         14          6     A
3         16          8     A

See here for more usage details.

Attributes

one_sample_ttest

two_sample_ttest

BACKEND_REGISTRY

CHECK_FUNCTION_REGISTRY

REGISTERED_CUSTOM_CHECKS

Methods

__init__(check_fn, groups=None, groupby=None, ignore_na=True, element_wise=False, name=None, error=None, raise_warning=False, n_failure_cases=None, title=None, description=None, statistics=None, strategy=None, determined_by_unique=False, **check_kwargs)[source]

Apply a validation function to a data object.

Parameters:
  • check_fn (Callable) –

    A function to check data object. For Column or SeriesSchema checks, if element_wise is False, this function should have the signature: Callable[[pd.Series], Union[pd.Series, bool]], where the output series is a boolean vector.

    If element_wise is True, this function should have the signature: Callable[[Any], bool], where Any is an element in the column.

    For DataFrameSchema checks, if element_wise=False, fn should have the signature: Callable[[pd.DataFrame], Union[pd.DataFrame, pd.Series, bool]], where the output dataframe or series contains booleans.

    If element_wise is True, fn is applied to each row in the dataframe with the signature Callable[[pd.Series], bool] where the series input is a row in the dataframe.

  • groups (Union[str, list[str], None]) – The dict input to the fn callable will be constrained to the groups specified by groups.

  • groupby (Union[str, list[str], Callable, None]) –

    If a string or list of strings is provided, these columns are used to group the Column series. If a callable is passed, the expected signature is: Callable[ [pd.DataFrame], pd.core.groupby.DataFrameGroupBy]

    The the case of Column checks, this function has access to the entire dataframe, but Column.name is selected from this DataFrameGroupby object so that a SeriesGroupBy object is passed into check_fn.

    Specifying the groupby argument changes the check_fn signature to:

    Callable[[Dict[Union[str, Tuple[str]], pd.Series]], Union[bool, pd.Series]] # noqa

    where the input is a dictionary mapping keys to subsets of the column/dataframe.

  • ignore_na (bool) – If True, null values will be ignored when determining if a check passed or failed. For dataframes, ignores rows with any null value. New in version 0.4.0

  • element_wise (bool) – Whether or not to apply validator in an element-wise fashion. If bool, assumes that all checks should be applied to the column element-wise. If list, should be the same number of elements as checks.

  • name (UnionType[str, None]) – optional name for the check.

  • error (UnionType[str, None]) – custom error message if series fails validation check.

  • raise_warning (bool) – if True, raise a SchemaWarning and do not throw exception instead of raising a SchemaError for a specific check. This option should be used carefully in cases where a failing check is informational and shouldn’t stop execution of the program.

  • n_failure_cases (UnionType[int, None]) – report the first n unique failure cases. If None, report all failure cases.

  • title (UnionType[str, None]) – A human-readable label for the check.

  • description (UnionType[str, None]) – An arbitrary textual description of the check.

  • statistics (UnionType[dict[str, Any], None]) – kwargs to pass into the check function. These values are serialized and represent the constraints of the checks.

  • strategy (UnionType[Any, None]) – A hypothesis strategy, used for implementing data synthesis strategies for this check. See the User Guide for more details.

  • determined_by_unique (bool) – If True, indicates that this check’s result is fully determined by the unique values in the data, meaning duplicate values don’t affect the outcome. This enables significant performance optimizations for MultiIndex validation when dealing with large datasets. If True, the check function must produce the same result whether applied to unique values or full values.

  • check_kwargs – key-word arguments to pass into check_fn

Example:

The example below uses pandas, but will apply to any of the supported dataframe libraries.

>>> import pandas as pd
>>> import pandera.pandas as pa
>>>
>>>
>>> # column checks are vectorized by default
>>> check_positive = pa.Check(lambda s: s > 0)
>>>
>>> # define an element-wise check
>>> check_even = pa.Check(lambda x: x % 2 == 0, element_wise=True)
>>>
>>> # checks can be given human-readable metadata
>>> check_with_metadata = pa.Check(
...     lambda x: True,
...     title="Always passes",
...     description="This check always passes."
... )
>>>
>>> # specify assertions across categorical variables using `groupby`,
>>> # for example, make sure the mean measure for group "A" is always
>>> # larger than the mean measure for group "B"
>>> check_by_group = pa.Check(
...     lambda measures: measures["A"].mean() > measures["B"].mean(),
...     groupby=["group"],
... )
>>>
>>> # define a wide DataFrame-level check
>>> check_dataframe = pa.Check(
...     lambda df: df["measure_1"] > df["measure_2"])
>>>
>>> measure_checks = [check_positive, check_even, check_by_group]
>>>
>>> schema = pa.DataFrameSchema(
...     columns={
...         "measure_1": pa.Column(int, checks=measure_checks),
...         "measure_2": pa.Column(int, checks=measure_checks),
...         "group": pa.Column(str),
...     },
...     checks=check_dataframe
... )
>>>
>>> df = pd.DataFrame({
...     "measure_1": [10, 12, 14, 16],
...     "measure_2": [2, 4, 6, 8],
...     "group": ["B", "B", "A", "A"]
... })
>>>
>>> schema.validate(df)[["measure_1", "measure_2", "group"]]
    measure_1  measure_2 group
0         10          2     B
1         12          4     B
2         14          6     A
3         16          8     A

See here for more usage details.

classmethod between(min_value, max_value, include_min=True, include_max=True, **kwargs)[source]

Alias of in_range()

Return type:

Check

classmethod cf_has_cell_methods(expected, **kwargs)[source]

Require cell_methods attr equals expected.

Lightweight CF check that inspects .attrs["cell_methods"].

Return type:

Check

classmethod cf_has_standard_names(names, **kwargs)[source]

Require cf_xarray can resolve each standard name.

Needs cf_xarray installed. Each name must be resolvable via data.cf[name].

Return type:

Check

classmethod cf_standard_name(expected_name, **kwargs)[source]

Require standard_name attr equals expected_name.

Lightweight CF check that inspects .attrs["standard_name"] without requiring cf_xarray.

Return type:

Check

classmethod cf_units(expected_units, **kwargs)[source]

Require units attr equals expected_units.

Lightweight CF check that inspects .attrs["units"] without requiring cf_xarray.

Return type:

Check

classmethod dim_size(dim, size, **kwargs)[source]

Assert data.sizes[dim] == size.

Prefer schema sizes={dim: size} when defining a DataArraySchema or DatasetSchema.

Return type:

Check

classmethod eq(value, **kwargs)[source]

Alias of equal_to()

Return type:

Check

classmethod equal_to(value, **kwargs)[source]

Ensure all elements of a data container equal a certain value.

Parameters:

value (Any) – values in this data object must be equal to this value.

Return type:

Check

classmethod ge(min_value, **kwargs)[source]

Alias of greater_than_or_equal_to()

Return type:

Check

classmethod greater_than(min_value, **kwargs)[source]

Ensure values of a data container are strictly greater than a minimum value.

Parameters:

min_value (Any) – Lower bound to be exceeded. Must be a type comparable to the dtype of the data object to be validated (e.g. a numerical type for float or int and a datetime for datetime).

Return type:

Check

classmethod greater_than_or_equal_to(min_value, **kwargs)[source]

Ensure all values are greater or equal a certain value.

Parameters:

min_value (Any) – Allowed minimum value for values of the data. Must be a type comparable to the dtype of the data object to be validated.

Return type:

Check

classmethod gt(min_value, **kwargs)[source]

Alias of greater_than()

Return type:

Check

classmethod has_attrs(attrs, **kwargs)[source]

Match key-value pairs on .attrs (xarray).

Parameters:

attrs (dict[str, Any]) – Dictionary of attribute name-value pairs to require.

Prefer schema attrs= on DataArraySchema / DatasetSchema when that is the primary contract.

Return type:

Check

classmethod has_coords(coords, **kwargs)[source]

Require coordinate names on an xarray object.

Parameters:

coords (Union[tuple[str, …], list[str]]) – Tuple or list of coordinate name strings.

Prefer schema coords= when declaring a full DataArraySchema / DatasetSchema.

Return type:

Check

classmethod has_dims(dims, **kwargs)[source]

Require dimension names (order-independent) on an xarray object.

Parameters:

dims (Union[tuple[str, …], list[str]]) – Tuple or list of dimension name strings.

Prefer DataArraySchema / DatasetSchema dims= when defining a schema; use this for dataset-level or ad hoc checks.

Return type:

Check

classmethod has_encoding(encoding, **kwargs)[source]

Match key-value pairs on .encoding (xarray).

Parameters:

encoding (dict[str, Any]) – Dictionary of encoding key-value pairs to require.

Prefer schema encoding= on DataArraySchema / DatasetSchema when that is the primary contract.

Return type:

Check

classmethod in_range(*args, min_value=None, max_value=None, include_min=True, include_max=True, **kwargs)[source]

Ensure all values of a series are within an interval.

Both endpoints must be a type comparable to the dtype of the data object to be validated.

Parameters:
  • args – Positional arguments. If a single value is provided, it represents the exact value. If two values are provided, they represent min_value and max_value respectively. If three values are provided, they represent min_value, max_value, and include_min respectively. If four values are provided, they represent min_value, max_value, include_min, and include_max respectively.

  • min_value (Optional[~T]) – Left / lower endpoint of the interval.

  • max_value (Optional[~T]) – Right / upper endpoint of the interval. Must not be smaller than min_value.

  • include_min (bool) – Defines whether min_value is also an allowed value (the default) or whether all values must be strictly greater than min_value.

  • include_max (bool) – Defines whether min_value is also an allowed value (the default) or whether all values must be strictly smaller than max_value.

Example:

>>> import pandera as pa
>>>
>>> positional_check = pa.Check.in_range(0, 1)
>>> positional_include_min_check = pa.Check.in_range(0, 1, True)
>>> positional_include_min_max_check = pa.Check.in_range(0, 1, True, True)
>>> keyword_check = pa.Check.in_range(min_value=0, max_value=1)
>>> keyword_include_min_check = pa.Check.in_range(min_value=0, max_value=1, include_min=True)
>>> keyword_include_min_max_check = pa.Check.in_range(min_value=0, max_value=1, include_min=True, include_max=True)
Return type:

Check

classmethod is_monotonic(dim, increasing=True, **kwargs)[source]

Assert a 1-D coordinate is strictly monotonic along dim.

This is a value constraint on coordinate labels, not usually expressed by dims / sizes alone.

Return type:

Check

classmethod isin(*args, allowed_values=None, **kwargs)[source]

Ensure only allowed values occur within a series.

This checks whether all elements of a data object are part of the set of elements of allowed values. If allowed values is a string, the set of elements consists of all distinct characters of the string. Thus only single characters which occur in allowed_values at least once can meet this condition. If you want to check for substrings use Check.str_contains().

Parameters:
  • args – Positional arguments. If a single list/tuple is provided, it represents the allowed values. If multiple values are provided, they represent the allowed values.

  • allowed_values (UnionType[Iterable, None]) – The set of allowed values. May be any iterable.

  • kwargs – key-word arguments passed into the Check initializer.

Example:

>>> import pandera as pa
>>>
>>> positional_check = pa.Check.isin([1, 2, 3])
>>> positional_values_check = pa.Check.isin(1, 2, 3)
>>> keyword_check = pa.Check.isin(allowed_values=[1, 2, 3])
>>> keyword_values_check = pa.Check.isin(allowed_values=[1, 2, 3])
Return type:

Check

classmethod le(max_value, **kwargs)[source]

Alias of less_than_or_equal_to()

Return type:

Check

classmethod less_than(max_value, **kwargs)[source]

Ensure values of a series are strictly below a maximum value.

Parameters:

max_value (Any) – All elements of a series must be strictly smaller than this. Must be a type comparable to the dtype of the data object to be validated.

Return type:

Check

classmethod less_than_or_equal_to(max_value, **kwargs)[source]

Ensure values of a series are strictly below a maximum value.

Parameters:

max_value (Any) – Upper bound not to be exceeded. Must be a type comparable to the dtype of the data object to be validated.

Return type:

Check

classmethod lt(max_value, **kwargs)[source]

Alias of less_than()

Return type:

Check

classmethod ndim(n, **kwargs)[source]

Assert dimensionality (DataArray.ndim or len(Dataset.dims)).

Often redundant with an explicit dims= tuple on the schema; kept for dataset-level checks and parity with a single scalar constraint.

Return type:

Check

classmethod ne(value, **kwargs)[source]

Alias of not_equal_to()

Return type:

Check

classmethod no_duplicates_in_coord(coord, **kwargs)[source]

Assert coordinate values are unique.

A value-level constraint on the coordinate index; not implied by schema dims or coords presence alone.

Return type:

Check

classmethod not_equal_to(value, **kwargs)[source]

Ensure no elements of a data container equals a certain value.

Parameters:

value (Any) – This value must not occur in the data object.

Return type:

Check

classmethod notin(*args, forbidden_values=None, **kwargs)[source]

Ensure some defined values don’t occur within a series.

Like Check.isin() this check operates on single characters if it is applied on strings. If forbidden_values is a string, it is understood as set of prohibited characters. Any string of length > 1 can’t be in it by design.

Parameters:
  • args – Positional arguments. If a single list/tuple is provided, it represents the forbidden values. If multiple values are provided, they represent the forbidden values.

  • forbidden_values (UnionType[Iterable, None]) – The set of values which should not occur. May be any iterable.

  • raise_warning – if True, check raises SchemaWarning instead of SchemaError on validation.

Example:

>>> import pandera as pa
>>>
>>> positional_check = pa.Check.notin([1, 2, 3])
>>> positional_values_check = pa.Check.notin(1, 2, 3)
>>> keyword_check = pa.Check.notin(forbidden_values=[1, 2, 3])
Return type:

Check

classmethod str_contains(pattern, **kwargs)[source]

Ensure that a pattern can be found within each row.

Parameters:
  • pattern (Union[str, Pattern]) – Regular expression pattern to use for searching

  • kwargs – key-word arguments passed into the Check initializer.

Return type:

Check

classmethod str_endswith(string, **kwargs)[source]

Ensure that all values end with a certain string.

Parameters:
  • string (str) – String all values should end with

  • kwargs – key-word arguments passed into the Check initializer.

Return type:

Check

classmethod str_length(*args, min_value=None, max_value=None, exact_value=None, **kwargs)[source]

Ensure that the length of strings is within a specified range.

This method supports multiple calling conventions:

Check.str_length(5)  # exact length of 5
Check.str_length(1, 5)  # length between 1 and 5 (inclusive)
Check.str_length(min_value=1, max_value=5)  # same as above
Check.str_length(min_value=1)  # length >= 1
Check.str_length(max_value=5)  # length <= 5
Parameters:
  • args – Positional arguments. If one value is provided, it represents the exact length. If two values are provided, they represent min_value and max_value respectively.

  • min_value (UnionType[int, None]) – Minimum length of strings (default: no minimum)

  • max_value (UnionType[int, None]) – Maximum length of strings (default: no maximum)

  • exact_value (UnionType[int, None]) – Exact length of strings. (default: no exact value)

  • kwargs – key-word arguments passed into the Check initializer.

Return type:

Check

classmethod str_matches(pattern, **kwargs)[source]

Ensure that strings start with regular expression match.

Parameters:
  • pattern (Union[str, Pattern]) – Regular expression pattern to use for matching

  • kwargs – key-word arguments passed into the Check initializer.

Return type:

Check

classmethod str_startswith(string, **kwargs)[source]

Ensure that all values start with a certain string.

Parameters:
  • string (str) – String all values should start with

  • kwargs – key-word arguments passed into the Check initializer.

Return type:

Check

classmethod unique_values_eq(values, **kwargs)[source]

Ensure that unique values in the data object contain all values.

Note

In contrast with isin(), this check makes sure that all the items in the values iterable are contained within the series.

Parameters:

values (Iterable) – The set of values that must be present. May be any iterable.

Return type:

Check

__call__(check_obj, column=None)[source]

Validate DataFrame or Series.

Parameters:
  • check_obj (Any) – DataFrame of Series to validate.

  • column (UnionType[str, None]) – for dataframe checks, apply the check function to this column.

Return type:

CheckResult

Returns:

CheckResult tuple containing:

check_output: boolean scalar, Series or DataFrame indicating which elements passed the check.

check_passed: boolean scalar that indicating whether the check passed overall.

checked_object: the checked object itself. Depending on the options provided to the Check, this will be a Series, DataFrame, or if the groupby option is supported by the validation backend and specified, a Dict[str, Series] or Dict[str, DataFrame] where the keys are distinct groups.

failure_cases: subset of the check_object that failed.