pandera.schemas.DataFrameSchema#
- class pandera.schemas.DataFrameSchema(columns=None, checks=None, index=None, dtype=None, coerce=False, strict=False, name=None, ordered=False, unique=None, report_duplicates='all', unique_column_names=False, title=None, description=None)[source]#
A light-weight pandas DataFrame validator.
Initialize DataFrameSchema validator.
- Parameters
columns (mapping of column names and column schema component.) – a dict where keys are column names and values are Column objects specifying the datatypes and properties of a particular column.
checks (CheckList) – dataframe-wide checks.
index – specify the datatypes and properties of the index.
dtype (PandasDtypeInputTypes) – datatype of the dataframe. This overrides the data types specified in any of the columns. If a string is specified, then assumes one of the valid pandas string values: http://pandas.pydata.org/pandas-docs/stable/basics.html#dtypes.
coerce (bool) – whether or not to coerce all of the columns on validation. This has no effect on columns where
dtype=None
strict (StrictType) – ensure that all and only the columns defined in the schema are present in the dataframe. If set to ‘filter’, only the columns in the schema will be passed to the validated dataframe. If set to filter and columns defined in the schema are not present in the dataframe, will throw an error.
name (Optional[str]) – name of the schema.
ordered (bool) – whether or not to validate the columns order.
unique (Optional[Union[str, List[str]]]) – a list of columns that should be jointly unique.
report_duplicates (UniqueSettings) – how to report unique errors - exclude_first: report all duplicates except first occurence - exclude_last: report all duplicates except last occurence - all: (default) report all duplicates
unique_column_names (bool) – whether or not column names must be unique.
title (Optional[str]) – A human-readable label for the schema.
description (Optional[str]) – An arbitrary textual description of the schema.
- Raises
SchemaInitError – if impossible to build schema from parameters
- Examples
>>> import pandera as pa >>> >>> schema = pa.DataFrameSchema({ ... "str_column": pa.Column(str), ... "float_column": pa.Column(float), ... "int_column": pa.Column(int), ... "date_column": pa.Column(pa.DateTime), ... })
Use the pandas API to define checks, which takes a function with the signature:
pd.Series -> Union[bool, pd.Series]
where the output series contains boolean values.>>> schema_withchecks = pa.DataFrameSchema({ ... "probability": pa.Column( ... float, pa.Check(lambda s: (s >= 0) & (s <= 1))), ... ... # check that the "category" column contains a few discrete ... # values, and the majority of the entries are dogs. ... "category": pa.Column( ... str, [ ... pa.Check(lambda s: s.isin(["dog", "cat", "duck"])), ... pa.Check(lambda s: (s == "dog").mean() > 0.5), ... ]), ... })
See here for more usage details.
Attributes
coerce
Whether to coerce series to specified type.
description
An arbitrary textual description of the schema.
dtype
Get the dtype property.
dtypes
A dict where the keys are column names and values are
DataType
s for the column.ordered
Whether or not to validate the columns order.
title
A human-readable label for the schema.
unique
List of columns that should be jointly unique.
unique_column_names
Whether multiple columns with the same name can be present.
Methods
Initialize DataFrameSchema validator.
Create a copy of the
DataFrameSchema
with extra columns.Coerce dataframe to the type specified in dtype.
Generate an example of a particular size.
Create DataFrameSchema from yaml file.
Same as the
dtype
property, but expands columns whereregex == True
based on the supplied dataframe.Removes columns from a
DataFrameSchema
and returns a new copy.Rename columns using a dictionary of key-value pairs.
A method for resetting the
Index
of aDataFrameSchema
Select subset of columns in the schema.
A method for setting the
Index
of aDataFrameSchema
, via an existingColumn
or list of columns.Create a
hypothesis
strategy for generating a DataFrame.Create DataFrameSchema from yaml file.
Write DataFrameSchema to yaml file.
Create copy of a
DataFrameSchema
with updated column properties.Create copy of a
DataFrameSchema
with updated column properties.Check if all columns in a dataframe have a column in the Schema.
Alias for
DataFrameSchema.validate()
method.