pandera.api.pandas.container.DataFrameSchema

class pandera.api.pandas.container.DataFrameSchema(columns=None, checks=None, parsers=None, index=None, dtype=None, coerce=False, strict=False, name=None, ordered=False, unique=None, report_duplicates='all', unique_column_names=False, add_missing_columns=False, title=None, description=None, metadata=None, drop_invalid_rows=False)[source]

A light-weight pandas DataFrame validator.

Library-agnostic base class for DataFrameSchema definitions.

Parameters:
  • columns (mapping of column names and column schema component.) – a dict where keys are column names and values are Column objects specifying the datatypes and properties of a particular column.

  • checks (Union[Check, List[Union[Check, Hypothesis]], None]) – dataframe-wide checks.

  • parsers (Union[Parser, List[Parser], None]) – dataframe-wide parsers.

  • index – specify the datatypes and properties of the index.

  • dtype (Optional[Any, None]) – datatype of the dataframe. This overrides the data types specified in any of the columns. If a string is specified, then assumes one of the valid pandas string values: http://pandas.pydata.org/pandas-docs/stable/basics.html#dtypes.

  • coerce (bool) – whether or not to coerce all of the columns on validation. This overrides any coerce setting at the column or index level. This has no effect on columns where dtype=None.

  • strict (Union[bool, Literal[‘filter’]]) – ensure that all and only the columns defined in the schema are present in the dataframe. If set to ‘filter’, only the columns in the schema will be passed to the validated dataframe. If set to filter and columns defined in the schema are not present in the dataframe, will throw an error.

  • name (Optional[str, None]) – name of the schema.

  • ordered (bool) – whether or not to validate the columns order.

  • unique (Union[str, List[str], None]) – a list of columns that should be jointly unique.

  • report_duplicates (Union[Literal[‘exclude_first’], Literal[‘exclude_last’], Literal[‘all’]]) – how to report unique errors - exclude_first: report all duplicates except first occurence - exclude_last: report all duplicates except last occurence - all: (default) report all duplicates

  • unique_column_names (bool) – whether or not column names must be unique.

  • add_missing_columns (bool) – add missing column names with either default value, if specified in column schema, or NaN if column is nullable.

  • title (Optional[str, None]) – A human-readable label for the schema.

  • description (Optional[str, None]) – An arbitrary textual description of the schema.

  • metadata (Optional[dict, None]) – An optional key-value data.

  • drop_invalid_rows (bool) – if True, drop invalid rows on validation.

Raises:

SchemaInitError – if impossible to build schema from parameters

Examples:

>>> import pandera as pa
>>>
>>> schema = pa.DataFrameSchema({
...     "str_column": pa.Column(str),
...     "float_column": pa.Column(float),
...     "int_column": pa.Column(int),
...     "date_column": pa.Column(pa.DateTime),
... })

Use the pandas API to define checks, which takes a function with the signature: pd.Series -> Union[bool, pd.Series] where the output series contains boolean values.

>>> schema_withchecks = pa.DataFrameSchema({
...     "probability": pa.Column(
...         float, pa.Check(lambda s: (s >= 0) & (s <= 1))),
...
...     # check that the "category" column contains a few discrete
...     # values, and the majority of the entries are dogs.
...     "category": pa.Column(
...         str, [
...             pa.Check(lambda s: s.isin(["dog", "cat", "duck"])),
...             pa.Check(lambda s: (s == "dog").mean() > 0.5),
...         ]),
... })

See here for more usage details.

Attributes

BACKEND_REGISTRY

coerce

Whether to coerce series to specified type.

dtype

Get the dtype property.

dtypes

A dict where the keys are column names and values are DataType s for the column.

properties

Get the properties of the schema for serialization purposes.

unique

List of columns that should be jointly unique.

Methods

validate(check_obj, head=None, tail=None, sample=None, random_state=None, lazy=False, inplace=False)[source]

Validate a DataFrame based on the schema specification.

Parameters:
  • check_obj (pd.DataFrame) – the dataframe to be validated.

  • head (Optional[int, None]) – validate the first n rows. Rows overlapping with tail or sample are de-duplicated.

  • tail (Optional[int, None]) – validate the last n rows. Rows overlapping with head or sample are de-duplicated.

  • sample (Optional[int, None]) – validate a random sample of n rows. Rows overlapping with head or tail are de-duplicated.

  • random_state (Optional[int, None]) – random seed for the sample argument.

  • lazy (bool) – if True, lazily evaluates dataframe against all validation checks and raises a SchemaErrors. Otherwise, raise SchemaError as soon as one occurs.

  • inplace (bool) – if True, applies coercion to the object of validation, otherwise creates a copy of the data.

Return type:

DataFrame

Returns:

validated DataFrame

Raises:

SchemaError – when DataFrame violates built-in or custom checks.

Example:

Calling schema.validate returns the dataframe.

>>> import pandas as pd
>>> import pandera as pa
>>>
>>> df = pd.DataFrame({
...     "probability": [0.1, 0.4, 0.52, 0.23, 0.8, 0.76],
...     "category": ["dog", "dog", "cat", "duck", "dog", "dog"]
... })
>>>
>>> schema_withchecks = pa.DataFrameSchema({
...     "probability": pa.Column(
...         float, pa.Check(lambda s: (s >= 0) & (s <= 1))),
...
...     # check that the "category" column contains a few discrete
...     # values, and the majority of the entries are dogs.
...     "category": pa.Column(
...         str, [
...             pa.Check(lambda s: s.isin(["dog", "cat", "duck"])),
...             pa.Check(lambda s: (s == "dog").mean() > 0.5),
...         ]),
... })
>>>
>>> schema_withchecks.validate(df)[["probability", "category"]]
   probability category
0         0.10      dog
1         0.40      dog
2         0.52      cat
3         0.23     duck
4         0.80      dog
5         0.76      dog
__call__(dataframe, head=None, tail=None, sample=None, random_state=None, lazy=False, inplace=False)[source]

Alias for DataFrameSchema.validate() method.

Parameters:
  • dataframe (pd.DataFrame) – the dataframe to be validated.

  • head (int) – validate the first n rows. Rows overlapping with tail or sample are de-duplicated.

  • tail (int) – validate the last n rows. Rows overlapping with head or sample are de-duplicated.

  • sample (Optional[int, None]) – validate a random sample of n rows. Rows overlapping with head or tail are de-duplicated.

  • random_state (Optional[int, None]) – random seed for the sample argument.

  • lazy (bool) – if True, lazily evaluates dataframe against all validation checks and raises a SchemaErrors. Otherwise, raise SchemaError as soon as one occurs.

  • inplace (bool) – if True, applies coercion to the object of validation, otherwise creates a copy of the data.

Return type:

~TDataObject