pandera.api.dataframe.container.DataFrameSchema

class pandera.api.dataframe.container.DataFrameSchema(columns=None, checks=None, parsers=None, index=None, dtype=None, coerce=False, strict=False, name=None, ordered=False, unique=None, report_duplicates='all', unique_column_names=False, add_missing_columns=False, title=None, description=None, metadata=None, drop_invalid_rows=False)[source]

Library-agnostic base class for DataFrameSchema definitions.

Parameters:
  • columns (mapping of column names and column schema component.) – a dict where keys are column names and values are Column objects specifying the datatypes and properties of a particular column.

  • checks (Union[Check, List[Union[Check, Hypothesis]], None]) – dataframe-wide checks.

  • parsers (Union[Parser, List[Parser], None]) – dataframe-wide parsers.

  • index – specify the datatypes and properties of the index.

  • dtype (Optional[Any, None]) – datatype of the dataframe. This overrides the data types specified in any of the columns. If a string is specified, then assumes one of the valid pandas string values: http://pandas.pydata.org/pandas-docs/stable/basics.html#dtypes.

  • coerce (bool) – whether or not to coerce all of the columns on validation. This overrides any coerce setting at the column or index level. This has no effect on columns where dtype=None.

  • strict (Union[bool, Literal[‘filter’]]) – ensure that all and only the columns defined in the schema are present in the dataframe. If set to ‘filter’, only the columns in the schema will be passed to the validated dataframe. If set to filter and columns defined in the schema are not present in the dataframe, will throw an error.

  • name (Optional[str, None]) – name of the schema.

  • ordered (bool) – whether or not to validate the columns order.

  • unique (Union[str, List[str], None]) – a list of columns that should be jointly unique.

  • report_duplicates (Union[Literal[‘exclude_first’], Literal[‘exclude_last’], Literal[‘all’]]) – how to report unique errors - exclude_first: report all duplicates except first occurence - exclude_last: report all duplicates except last occurence - all: (default) report all duplicates

  • unique_column_names (bool) – whether or not column names must be unique.

  • add_missing_columns (bool) – add missing column names with either default value, if specified in column schema, or NaN if column is nullable.

  • title (Optional[str, None]) – A human-readable label for the schema.

  • description (Optional[str, None]) – An arbitrary textual description of the schema.

  • metadata (Optional[dict, None]) – An optional key-value data.

  • drop_invalid_rows (bool) – if True, drop invalid rows on validation.

Raises:

SchemaInitError – if impossible to build schema from parameters

Examples:

>>> import pandera as pa
>>>
>>> schema = pa.DataFrameSchema({
...     "str_column": pa.Column(str),
...     "float_column": pa.Column(float),
...     "int_column": pa.Column(int),
...     "date_column": pa.Column(pa.DateTime),
... })

Use the pandas API to define checks, which takes a function with the signature: pd.Series -> Union[bool, pd.Series] where the output series contains boolean values.

>>> schema_withchecks = pa.DataFrameSchema({
...     "probability": pa.Column(
...         float, pa.Check(lambda s: (s >= 0) & (s <= 1))),
...
...     # check that the "category" column contains a few discrete
...     # values, and the majority of the entries are dogs.
...     "category": pa.Column(
...         str, [
...             pa.Check(lambda s: s.isin(["dog", "cat", "duck"])),
...             pa.Check(lambda s: (s == "dog").mean() > 0.5),
...         ]),
... })

See here for more usage details.

Attributes

BACKEND_REGISTRY

coerce

Whether to coerce series to specified type.

dtype

Get the dtype property.

dtypes

A dict where the keys are column names and values are DataType s for the column.

properties

Get the properties of the schema for serialization purposes.

unique

List of columns that should be jointly unique.

Methods

__init__(columns=None, checks=None, parsers=None, index=None, dtype=None, coerce=False, strict=False, name=None, ordered=False, unique=None, report_duplicates='all', unique_column_names=False, add_missing_columns=False, title=None, description=None, metadata=None, drop_invalid_rows=False)[source]

Library-agnostic base class for DataFrameSchema definitions.

Parameters:
  • columns (mapping of column names and column schema component.) – a dict where keys are column names and values are Column objects specifying the datatypes and properties of a particular column.

  • checks (Union[Check, List[Union[Check, Hypothesis]], None]) – dataframe-wide checks.

  • parsers (Union[Parser, List[Parser], None]) – dataframe-wide parsers.

  • index – specify the datatypes and properties of the index.

  • dtype (Optional[Any, None]) – datatype of the dataframe. This overrides the data types specified in any of the columns. If a string is specified, then assumes one of the valid pandas string values: http://pandas.pydata.org/pandas-docs/stable/basics.html#dtypes.

  • coerce (bool) – whether or not to coerce all of the columns on validation. This overrides any coerce setting at the column or index level. This has no effect on columns where dtype=None.

  • strict (Union[bool, Literal[‘filter’]]) – ensure that all and only the columns defined in the schema are present in the dataframe. If set to ‘filter’, only the columns in the schema will be passed to the validated dataframe. If set to filter and columns defined in the schema are not present in the dataframe, will throw an error.

  • name (Optional[str, None]) – name of the schema.

  • ordered (bool) – whether or not to validate the columns order.

  • unique (Union[str, List[str], None]) – a list of columns that should be jointly unique.

  • report_duplicates (Union[Literal[‘exclude_first’], Literal[‘exclude_last’], Literal[‘all’]]) – how to report unique errors - exclude_first: report all duplicates except first occurence - exclude_last: report all duplicates except last occurence - all: (default) report all duplicates

  • unique_column_names (bool) – whether or not column names must be unique.

  • add_missing_columns (bool) – add missing column names with either default value, if specified in column schema, or NaN if column is nullable.

  • title (Optional[str, None]) – A human-readable label for the schema.

  • description (Optional[str, None]) – An arbitrary textual description of the schema.

  • metadata (Optional[dict, None]) – An optional key-value data.

  • drop_invalid_rows (bool) – if True, drop invalid rows on validation.

Raises:

SchemaInitError – if impossible to build schema from parameters

Examples:

>>> import pandera as pa
>>>
>>> schema = pa.DataFrameSchema({
...     "str_column": pa.Column(str),
...     "float_column": pa.Column(float),
...     "int_column": pa.Column(int),
...     "date_column": pa.Column(pa.DateTime),
... })

Use the pandas API to define checks, which takes a function with the signature: pd.Series -> Union[bool, pd.Series] where the output series contains boolean values.

>>> schema_withchecks = pa.DataFrameSchema({
...     "probability": pa.Column(
...         float, pa.Check(lambda s: (s >= 0) & (s <= 1))),
...
...     # check that the "category" column contains a few discrete
...     # values, and the majority of the entries are dogs.
...     "category": pa.Column(
...         str, [
...             pa.Check(lambda s: s.isin(["dog", "cat", "duck"])),
...             pa.Check(lambda s: (s == "dog").mean() > 0.5),
...         ]),
... })

See here for more usage details.

add_columns(extra_schema_cols)[source]

Create a copy of the DataFrameSchema with extra columns.

Parameters:

extra_schema_cols (DataFrameSchema) – Additional columns of the format

Return type:

Self

Returns:

a new DataFrameSchema with the extra_schema_cols added.

Example:

To add columns to the schema, pass a dictionary with column name and Column instance key-value pairs.

>>> import pandera as pa
>>>
>>> example_schema = pa.DataFrameSchema(
...    {
...        "category": pa.Column(str),
...        "probability": pa.Column(float),
...    }
... )
>>> print(
...     example_schema.add_columns({"even_number": pa.Column(pa.Bool)})
... )
<Schema DataFrameSchema(
    columns={
        'category': <Schema Column(name=category, type=DataType(str))>
        'probability': <Schema Column(name=probability, type=DataType(float64))>
        'even_number': <Schema Column(name=even_number, type=DataType(bool))>
    },
    checks=[],
    parsers=[],
    coerce=False,
    dtype=None,
    index=None,
    strict=False,
    name=None,
    ordered=False,
    unique_column_names=False,
    metadata=None,
    add_missing_columns=False
)>

See also

remove_columns()

coerce_dtype(check_obj)[source]

Coerce object to the expected type.

Return type:

~TDataObject

classmethod from_json(source)[source]

Create DataFrameSchema from json file.

Parameters:

source – str, Path to json schema, or serialized yaml string.

Return type:

Self

Returns:

dataframe schema.

classmethod from_yaml(yaml_schema)[source]

Create DataFrameSchema from yaml file.

Parameters:

yaml_schema – str, Path to yaml schema, or serialized yaml string.

Return type:

Self

Returns:

dataframe schema.

get_dtypes(check_obj)[source]

Same as the dtype property, but expands columns where regex == True based on the supplied dataframe.

Return type:

Dict[str, DataType]

Returns:

dictionary of columns and their associated dtypes.

get_metadata()[source]

Provide metadata for columns and schema level

Return type:

Optional[dict, None]

remove_columns(cols_to_remove)[source]

Removes columns from a DataFrameSchema and returns a new copy.

Parameters:

cols_to_remove (List) – Columns to be removed from the DataFrameSchema

Return type:

Self

Returns:

a new DataFrameSchema without the cols_to_remove

Raises:

SchemaInitError: if column not in schema.

Example:

To remove a column or set of columns from a schema, pass a list of columns to be removed:

>>> import pandera as pa
>>>
>>> example_schema = pa.DataFrameSchema(
...     {
...         "category" : pa.Column(str),
...         "probability": pa.Column(float)
...     }
... )
>>>
>>> print(example_schema.remove_columns(["category"]))
<Schema DataFrameSchema(
    columns={
        'probability': <Schema Column(name=probability, type=DataType(float64))>
    },
    checks=[],
    parsers=[],
    coerce=False,
    dtype=None,
    index=None,
    strict=False,
    name=None,
    ordered=False,
    unique_column_names=False,
    metadata=None,
    add_missing_columns=False
)>

See also

add_columns()

rename_columns(rename_dict)[source]

Rename columns using a dictionary of key-value pairs.

Parameters:

rename_dict (Dict[str, str]) – dictionary of ‘old_name’: ‘new_name’ key-value pairs.

Return type:

Self

Returns:

DataFrameSchema (copy of original)

Raises:

SchemaInitError if column not in the schema.

Example:

To rename a column or set of columns, pass a dictionary of old column names and new column names, similar to the pandas DataFrame method.

>>> import pandera as pa
>>>
>>> example_schema = pa.DataFrameSchema({
...     "category" : pa.Column(str),
...     "probability": pa.Column(float)
... })
>>>
>>> print(
...     example_schema.rename_columns({
...         "category": "categories",
...         "probability": "probabilities"
...     })
... )
<Schema DataFrameSchema(
    columns={
        'categories': <Schema Column(name=categories, type=DataType(str))>
        'probabilities': <Schema Column(name=probabilities, type=DataType(float64))>
    },
    checks=[],
    parsers=[],
    coerce=False,
    dtype=None,
    index=None,
    strict=False,
    name=None,
    ordered=False,
    unique_column_names=False,
    metadata=None,
    add_missing_columns=False
)>

See also

update_column()

reset_index(level=None, drop=False)[source]

A method for resetting the Index of a DataFrameSchema

Parameters:
Return type:

Self

Returns:

a new DataFrameSchema with specified column(s) in the index.

Raises:

SchemaInitError if no index set in schema.

Examples:

Similar to the pandas reset_index method on a pandas DataFrame, this method can be used to to fully or partially reset indices of a schema.

To remove the entire index from the schema, just call the reset_index method with default parameters.

>>> import pandera as pa
>>>
>>> example_schema = pa.DataFrameSchema(
...     {"probability" : pa.Column(float)},
...     index = pa.Index(name="unique_id", dtype=int)
... )
>>>
>>> print(example_schema.reset_index())
<Schema DataFrameSchema(
    columns={
        'probability': <Schema Column(name=probability, type=DataType(float64))>
        'unique_id': <Schema Column(name=unique_id, type=DataType(int64))>
    },
    checks=[],
    parsers=[],
    coerce=False,
    dtype=None,
    index=None,
    strict=False,
    name=None,
    ordered=False,
    unique_column_names=False,
    metadata=None,
    add_missing_columns=False
)>

This reclassifies an index (or indices) as a column (or columns).

Similarly, to partially alter the index, pass the name of the column you would like to be removed to the level parameter, and you may also decide whether to drop the levels with the drop parameter.

>>> example_schema = pa.DataFrameSchema({
...     "category" : pa.Column(str)},
...     index = pa.MultiIndex([
...         pa.Index(name="unique_id1", dtype=int),
...         pa.Index(name="unique_id2", dtype=str)
...         ]
...     )
... )
>>> print(example_schema.reset_index(level = ["unique_id1"]))
<Schema DataFrameSchema(
    columns={
        'category': <Schema Column(name=category, type=DataType(str))>
        'unique_id1': <Schema Column(name=unique_id1, type=DataType(int64))>
    },
    checks=[],
    parsers=[],
    coerce=False,
    dtype=None,
    index=<Schema Index(name=unique_id2, type=DataType(str))>,
    strict=False,
    name=None,
    ordered=False,
    unique_column_names=False,
    metadata=None,
    add_missing_columns=False
)>

See also

set_index()

select_columns(columns)[source]

Select subset of columns in the schema.

New in version 0.4.5

Parameters:

columns (List[Any]) – list of column names to select.

Return type:

Self

Returns:

DataFrameSchema (copy of original) with only the selected columns, in the order specified.

Raises:

SchemaInitError if column not in the schema.

Example:

To subset and reorder a schema by column, and return a new schema:

>>> import pandera as pa
>>>
>>> example_schema = pa.DataFrameSchema({
...     "category": pa.Column(str),
...     "probability": pa.Column(float),
...     "timestamp": pa.Column(pa.DateTime)
... })
>>>
>>> print(example_schema.select_columns(['probability', 'category']))
<Schema DataFrameSchema(
    columns={
        'probability': <Schema Column(name=probability, type=DataType(float64))>
        'category': <Schema Column(name=category, type=DataType(str))>
    },
    checks=[],
    parsers=[],
    coerce=False,
    dtype=None,
    index=None,
    strict=False,
    name=None,
    ordered=False,
    unique_column_names=False,
    metadata=None,
    add_missing_columns=False
)>

Note

If an index is present in the schema, it will also be included in the new schema. The columns will be reordered to match the order in columns.

set_index(keys, drop=True, append=False)[source]

A method for setting the Index of a DataFrameSchema, via an existing Column or list of columns.

Parameters:
  • keys (List[str]) – list of labels

  • drop (bool) – bool, default True

  • append (bool) – bool, default False

Return type:

Self

Returns:

a new DataFrameSchema with specified column(s) in the index.

Raises:

SchemaInitError if column not in the schema.

Examples:

Just as you would set the index in a pandas DataFrame from an existing column, you can set an index within the schema from an existing column in the schema.

>>> import pandera as pa
>>>
>>> example_schema = pa.DataFrameSchema({
...     "category" : pa.Column(str),
...     "probability": pa.Column(float)})
>>>
>>> print(example_schema.set_index(['category']))
<Schema DataFrameSchema(
    columns={
        'probability': <Schema Column(name=probability, type=DataType(float64))>
    },
    checks=[],
    parsers=[],
    coerce=False,
    dtype=None,
    index=<Schema Index(name=category, type=DataType(str))>,
    strict=False,
    name=None,
    ordered=False,
    unique_column_names=False,
    metadata=None,
    add_missing_columns=False
)>

If you have an existing index in your schema, and you would like to append a new column as an index to it (yielding a Multiindex), just use set_index as you would in pandas.

>>> example_schema = pa.DataFrameSchema(
...     {
...         "column1": pa.Column(str),
...         "column2": pa.Column(int)
...     },
...     index=pa.Index(name = "column3", dtype = int)
... )
>>>
>>> print(example_schema.set_index(["column2"], append = True))
<Schema DataFrameSchema(
    columns={
        'column1': <Schema Column(name=column1, type=DataType(str))>
    },
    checks=[],
    parsers=[],
    coerce=False,
    dtype=None,
    index=<Schema MultiIndex(
        indexes=[
            <Schema Index(name=column3, type=DataType(int64))>
            <Schema Index(name=column2, type=DataType(int64))>
        ]
        coerce=False,
        strict=False,
        name=None,
        ordered=True
    )>,
    strict=False,
    name=None,
    ordered=False,
    unique_column_names=False,
    metadata=None,
    add_missing_columns=False
)>

See also

reset_index()

to_json(target: None = None, **kwargs) str[source]
to_json(target: PathLike, **kwargs) None

Write DataFrameSchema to json file.

Parameters:

target (Optional[PathLike, None]) – file target to write to. If None, dumps to string.

Return type:

Optional[str, None]

Returns:

json string if target is None, otherwise returns None.

to_script(fp=None)[source]

Write DataFrameSchema to python script.

Parameters:

path – str, Path to write script

Return type:

Self

Returns:

dataframe schema.

to_yaml(stream=None)[source]

Write DataFrameSchema to yaml file.

Parameters:

stream (Optional[PathLike, None]) – file stream to write to. If None, dumps to string.

Return type:

Optional[str, None]

Returns:

yaml string if stream is None, otherwise returns None.

update_column(column_name, **kwargs)[source]

Create copy of a DataFrameSchema with updated column properties.

Parameters:
  • column_name (str)

  • kwargs – key-word arguments supplied to Column

Return type:

Self

Returns:

a new DataFrameSchema with updated column

Raises:

SchemaInitError: if column not in schema or you try to change the name.

Example:

Calling schema.1 returns the DataFrameSchema with the updated column.

>>> import pandera as pa
>>>
>>> example_schema = pa.DataFrameSchema({
...     "category" : pa.Column(str),
...     "probability": pa.Column(float)
... })
>>> print(
...     example_schema.update_column(
...         'category', dtype=pa.Category
...     )
... )
<Schema DataFrameSchema(
    columns={
        'category': <Schema Column(name=category, type=DataType(category))>
        'probability': <Schema Column(name=probability, type=DataType(float64))>
    },
    checks=[],
    parsers=[],
    coerce=False,
    dtype=None,
    index=None,
    strict=False,
    name=None,
    ordered=False,
    unique_column_names=False,
    metadata=None,
    add_missing_columns=False
)>

See also

rename_columns()

update_columns(update_dict)[source]

Create copy of a DataFrameSchema with updated column properties.

Parameters:

update_dict (Dict[str, Dict[str, Any]])

Return type:

Self

Returns:

a new DataFrameSchema with updated columns

Raises:

SchemaInitError: if column not in schema or you try to change the name.

Example:

Calling schema.update_columns returns the DataFrameSchema with the updated columns.

>>> import pandera as pa
>>>
>>> example_schema = pa.DataFrameSchema({
...     "category" : pa.Column(str),
...     "probability": pa.Column(float)
... })
>>>
>>> print(
...     example_schema.update_columns(
...         {"category": {"dtype":pa.Category}}
...     )
... )
<Schema DataFrameSchema(
    columns={
        'category': <Schema Column(name=category, type=DataType(category))>
        'probability': <Schema Column(name=probability, type=DataType(float64))>
    },
    checks=[],
    parsers=[],
    coerce=False,
    dtype=None,
    index=None,
    strict=False,
    name=None,
    ordered=False,
    unique_column_names=False,
    metadata=None,
    add_missing_columns=False
)>
__call__(dataframe, head=None, tail=None, sample=None, random_state=None, lazy=False, inplace=False)[source]

Alias for DataFrameSchema.validate() method.

Parameters:
  • dataframe (pd.DataFrame) – the dataframe to be validated.

  • head (int) – validate the first n rows. Rows overlapping with tail or sample are de-duplicated.

  • tail (int) – validate the last n rows. Rows overlapping with head or sample are de-duplicated.

  • sample (Optional[int, None]) – validate a random sample of n rows. Rows overlapping with head or tail are de-duplicated.

  • random_state (Optional[int, None]) – random seed for the sample argument.

  • lazy (bool) – if True, lazily evaluates dataframe against all validation checks and raises a SchemaErrors. Otherwise, raise SchemaError as soon as one occurs.

  • inplace (bool) – if True, applies coercion to the object of validation, otherwise creates a copy of the data.

Return type:

~TDataObject