pandera.schemas.DataFrameSchema.__init__¶
- DataFrameSchema.__init__(columns=None, checks=None, index=None, dtype=None, transformer=None, coerce=False, strict=False, name=None, ordered=False, pandas_dtype=None, unique=None)[source]¶
Initialize DataFrameSchema validator.
- Parameters
columns (mapping of column names and column schema component.) – a dict where keys are column names and values are Column objects specifying the datatypes and properties of a particular column.
checks (CheckList) – dataframe-wide checks.
index – specify the datatypes and properties of the index.
dtype (PandasDtypeInputTypes) – datatype of the dataframe. This overrides the data types specified in any of the columns. If a string is specified, then assumes one of the valid pandas string values: http://pandas.pydata.org/pandas-docs/stable/basics.html#dtypes.
transformer (Callable) –
a callable with signature: pandas.DataFrame -> pandas.DataFrame. If specified, calling validate will verify properties of the columns and return the transformed dataframe object.
Warning
This feature is deprecated and no longer has an effect on validated dataframes.
coerce (bool) – whether or not to coerce all of the columns on validation. This has no effect on columns where
pandas_dtype=None
strict (Union[bool, str]) – ensure that all and only the columns defined in the schema are present in the dataframe. If set to ‘filter’, only the columns in the schema will be passed to the validated dataframe. If set to filter and columns defined in the schema are not present in the dataframe, will throw an error.
name (Optional[str]) – name of the schema.
ordered (bool) – whether or not to validate the columns order.
pandas_dtype (PandasDtypeInputTypes) –
alias of
dtype
for backwards compatibility.Warning
This option will be deprecated in 0.8.0
unique (Optional[Union[str, List[str]]]) – a list of columns that should be jointly unique.
- Raises
SchemaInitError – if impossible to build schema from parameters
SchemaInitError – if
dtype
andpandas_dtype
are both supplied.
- Examples
>>> import pandera as pa >>> >>> schema = pa.DataFrameSchema({ ... "str_column": pa.Column(str), ... "float_column": pa.Column(float), ... "int_column": pa.Column(int), ... "date_column": pa.Column(pa.DateTime), ... })
Use the pandas API to define checks, which takes a function with the signature:
pd.Series -> Union[bool, pd.Series]
where the output series contains boolean values.>>> schema_withchecks = pa.DataFrameSchema({ ... "probability": pa.Column( ... float, pa.Check(lambda s: (s >= 0) & (s <= 1))), ... ... # check that the "category" column contains a few discrete ... # values, and the majority of the entries are dogs. ... "category": pa.Column( ... str, [ ... pa.Check(lambda s: s.isin(["dog", "cat", "duck"])), ... pa.Check(lambda s: (s == "dog").mean() > 0.5), ... ]), ... })
See here for more usage details.