pandera.schemas.DataFrameSchema

class pandera.schemas.DataFrameSchema(columns=None, checks=None, index=None, dtype=None, transformer=None, coerce=False, strict=False, name=None, ordered=False, pandas_dtype=None, unique=None)[source]

A light-weight pandas DataFrame validator.

Initialize DataFrameSchema validator.

Parameters
  • columns (mapping of column names and column schema component.) – a dict where keys are column names and values are Column objects specifying the datatypes and properties of a particular column.

  • checks (Union[Check, Hypothesis, List[Union[Check, Hypothesis]], None]) – dataframe-wide checks.

  • index – specify the datatypes and properties of the index.

  • dtype (Union[str, type, DataType, ExtensionDtype, dtype, None]) – datatype of the dataframe. This overrides the data types specified in any of the columns. If a string is specified, then assumes one of the valid pandas string values: http://pandas.pydata.org/pandas-docs/stable/basics.html#dtypes.

  • transformer (Optional[Callable]) –

    a callable with signature: pandas.DataFrame -> pandas.DataFrame. If specified, calling validate will verify properties of the columns and return the transformed dataframe object.

    Warning

    This feature is deprecated and no longer has an effect on validated dataframes.

  • coerce (bool) – whether or not to coerce all of the columns on validation. This has no effect on columns where pandas_dtype=None

  • strict (Union[bool, str]) – ensure that all and only the columns defined in the schema are present in the dataframe. If set to ‘filter’, only the columns in the schema will be passed to the validated dataframe. If set to filter and columns defined in the schema are not present in the dataframe, will throw an error.

  • name (Optional[str]) – name of the schema.

  • ordered (bool) – whether or not to validate the columns order.

  • pandas_dtype (Union[str, type, DataType, ExtensionDtype, dtype, None]) –

    alias of dtype for backwards compatibility.

    Warning

    This option will be deprecated in 0.8.0

  • unique (Union[str, List[str], None]) – a list of columns that should be jointly unique.

Raises
Examples

>>> import pandera as pa
>>>
>>> schema = pa.DataFrameSchema({
...     "str_column": pa.Column(str),
...     "float_column": pa.Column(float),
...     "int_column": pa.Column(int),
...     "date_column": pa.Column(pa.DateTime),
... })

Use the pandas API to define checks, which takes a function with the signature: pd.Series -> Union[bool, pd.Series] where the output series contains boolean values.

>>> schema_withchecks = pa.DataFrameSchema({
...     "probability": pa.Column(
...         float, pa.Check(lambda s: (s >= 0) & (s <= 1))),
...
...     # check that the "category" column contains a few discrete
...     # values, and the majority of the entries are dogs.
...     "category": pa.Column(
...         str, [
...             pa.Check(lambda s: s.isin(["dog", "cat", "duck"])),
...             pa.Check(lambda s: (s == "dog").mean() > 0.5),
...         ]),
... })

See here for more usage details.

Attributes

coerce

Whether to coerce series to specified type.

dtype

Get the dtype property.

dtypes

A dict where the keys are column names and values are DataType s for the column.

ordered

Whether or not to validate the columns order.

unique

List of columns that should be jointly unique.

Methods

__init__

Initialize DataFrameSchema validator.

add_columns

Create a copy of the DataFrameSchema with extra columns.

coerce_dtype

Coerce dataframe to the type specified in dtype.

example

Generate an example of a particular size.

from_yaml

Create DataFrameSchema from yaml file.

get_dtypes

Same as the dtype property, but expands columns where regex == True based on the supplied dataframe.

remove_columns

Removes columns from a DataFrameSchema and returns a new copy.

rename_columns

Rename columns using a dictionary of key-value pairs.

reset_index

A method for resetting the Index of a DataFrameSchema

select_columns

Select subset of columns in the schema.

set_index

A method for setting the Index of a DataFrameSchema, via an existing Column or list of columns.

strategy

Create a hypothesis strategy for generating a DataFrame.

to_script

Create DataFrameSchema from yaml file.

to_yaml

Write DataFrameSchema to yaml file.

update_column

Create copy of a DataFrameSchema with updated column properties.

update_columns

Create copy of a DataFrameSchema with updated column properties.

validate

Check if all columns in a dataframe have a column in the Schema.

__call__

Alias for DataFrameSchema.validate() method.