pandera.api.dataframe.components.ComponentSchema

class pandera.api.dataframe.components.ComponentSchema(dtype=None, checks=None, parsers=None, nullable=False, unique=False, report_duplicates='all', coerce=False, name=None, title=None, description=None, default=None, metadata=None, drop_invalid_rows=False)[source]

Base class for data container component, e.g. columns.

Initialize array schema.

Parameters:
  • dtype (Optional[Any]) – datatype of the column.

  • checks (Union[Check, List[Union[Check, Hypothesis]], None]) –

    If element_wise is True, then callable signature should be:

    Callable[Any, bool] where the Any input is a scalar element in the column. Otherwise, the input is assumed to be a the data object (Series, DataFrame).

  • nullable (bool) – Whether or not column can contain null values.

  • unique (bool) – Whether or not column can contain duplicate values.

  • report_duplicates (Union[Literal[‘exclude_first’], Literal[‘exclude_last’], Literal[‘all’]]) – how to report unique errors - exclude_first: report all duplicates except first occurence - exclude_last: report all duplicates except last occurence - all: (default) report all duplicates

  • coerce (bool) – If True, when schema.validate is called the column will be coerced into the specified dtype. This has no effect on columns where dtype=None.

  • name (Any) – column name in dataframe to validate.

  • title (Optional[str]) – A human-readable label for the series.

  • description (Optional[str]) – An arbitrary textual description of the series.

  • metadata (Optional[dict]) – An optional key-value data.

  • default (Optional[Any]) – The default value for missing values in the series.

  • drop_invalid_rows (bool) – if True, drop invalid rows on validation.

Attributes

BACKEND_REGISTRY

properties

Get the properties of the schema for serialization purposes.

Methods

__init__(dtype=None, checks=None, parsers=None, nullable=False, unique=False, report_duplicates='all', coerce=False, name=None, title=None, description=None, default=None, metadata=None, drop_invalid_rows=False)[source]

Initialize array schema.

Parameters:
  • dtype (Optional[Any]) – datatype of the column.

  • checks (Union[Check, List[Union[Check, Hypothesis]], None]) –

    If element_wise is True, then callable signature should be:

    Callable[Any, bool] where the Any input is a scalar element in the column. Otherwise, the input is assumed to be a the data object (Series, DataFrame).

  • nullable (bool) – Whether or not column can contain null values.

  • unique (bool) – Whether or not column can contain duplicate values.

  • report_duplicates (Union[Literal[‘exclude_first’], Literal[‘exclude_last’], Literal[‘all’]]) – how to report unique errors - exclude_first: report all duplicates except first occurence - exclude_last: report all duplicates except last occurence - all: (default) report all duplicates

  • coerce (bool) – If True, when schema.validate is called the column will be coerced into the specified dtype. This has no effect on columns where dtype=None.

  • name (Any) – column name in dataframe to validate.

  • title (Optional[str]) – A human-readable label for the series.

  • description (Optional[str]) – An arbitrary textual description of the series.

  • metadata (Optional[dict]) – An optional key-value data.

  • default (Optional[Any]) – The default value for missing values in the series.

  • drop_invalid_rows (bool) – if True, drop invalid rows on validation.

coerce_dtype(check_obj)[source]

Coerce type of the data by type specified in dtype.

Parameters:

check_obj (~TDataObject) – data to coerce

Return type:

~TDataObject

Returns:

data of the same type as the input

set_checks(checks)[source]

Create a new SeriesSchema with a new set of Checks

Caution

This method will be deprecated in favor of update_checks in v0.15.0

Parameters:

checks (Union[Check, List[Union[Check, Hypothesis]]]) – checks to set on the new schema

Returns:

a new SeriesSchema with a new set of checks

update_checks(checks)[source]

Create a new SeriesSchema with a new set of Checks

Parameters:

checks (Union[Check, List[Union[Check, Hypothesis]]]) – checks to set on the new schema

Returns:

a new SeriesSchema with a new set of checks

validate(check_obj, head=None, tail=None, sample=None, random_state=None, lazy=False, inplace=False)[source]

Validate a series or specific column in dataframe.

Check_obj:

data object to validate.

Parameters:
  • head (Optional[int]) – validate the first n rows. Rows overlapping with tail or sample are de-duplicated.

  • tail (Optional[int]) – validate the last n rows. Rows overlapping with head or sample are de-duplicated.

  • sample (Optional[int]) – validate a random sample of n rows. Rows overlapping with head or tail are de-duplicated.

  • random_state (Optional[int]) – random seed for the sample argument.

  • lazy (bool) – if True, lazily evaluates dataframe against all validation checks and raises a SchemaErrors. Otherwise, raise SchemaError as soon as one occurs.

  • inplace (bool) – if True, applies coercion to the object of validation, otherwise creates a copy of the data.

Returns:

validated DataFrame or Series.

__call__(check_obj, head=None, tail=None, sample=None, random_state=None, lazy=False, inplace=False)[source]

Alias for validate method.

Return type:

~TDataObject