pandera.api.pyspark.container.DataFrameSchema¶

class pandera.api.pyspark.container.DataFrameSchema(columns=None, checks=None, dtype=None, coerce=False, strict=False, name=None, ordered=False, unique=None, report_duplicates='all', unique_column_names=False, title=None, description=None, metadata=None)[source]¶

A light-weight PySpark DataFrame validator.

Initialize DataFrameSchema validator.

Parameters:

columns (mapping of column names and column schema component.) – a dict where keys are column names and values are Column objects specifying the datatypes and properties of a particular column.
checks (Optional[CheckList]) – dataframe-wide checks.
dtype (PySparkDtypeInputTypes) – datatype of the dataframe. This overrides the data types specified in any of the columns. If a string is specified, then assumes one of the valid pyspark string values: https://spark.apache.org/docs/latest/sql-ref-datatypes.html.
coerce (bool) – whether or not to coerce all of the columns on validation. This has no effect on columns where dtype=None
strict (StrictType) – ensure that all and only the columns defined in the schema are present in the dataframe. If set to ‘filter’, only the columns in the schema will be passed to the validated dataframe. If set to filter and columns defined in the schema are not present in the dataframe, will throw an error.
name (Optional[str]) – name of the schema.
ordered (bool) – whether or not to validate the columns order.
unique (Optional[Union[str, List[str]]]) – a list of columns that should be jointly unique.
report_duplicates (UniqueSettings) – how to report unique errors - exclude_first: report all duplicates except first occurence - exclude_last: report all duplicates except last occurence - all: (default) report all duplicates
unique_column_names (bool) – whether or not column names must be unique.
title (Optional[str]) – A human-readable label for the schema.
description (Optional[str]) – An arbitrary textual description of the schema.
metadata (Optional[dict]) – An optional key-value data.

Raises:

SchemaInitError – if impossible to build schema from parameters

Examples:

>>> import pandera.pyspark as psa
>>> import pyspark.sql.types as pt
>>>
>>> schema = psa.DataFrameSchema({
...     "str_column": psa.Column(str),
...     "float_column": psa.Column(float),
...     "int_column": psa.Column(int),
...     "date_column": psa.Column(pt.DateType),
... })

Use the pyspark API to define checks, which takes a function with the signature: ps.Dataframe -> Union[bool] where the output contains boolean values.

>>> schema_withchecks = psa.DataFrameSchema({
...     "probability": psa.Column(
...         pt.DoubleType(), psa.Check.greater_than(0)),
...
...     # check that the "category" column contains a few discrete
...     # values, and the majority of the entries are dogs.
...     "category": psa.Column(
...         pt.StringType(), psa.Check.str_startswith("B"),
...            ),
... })

See here for more usage details.

Attributes

`BACKEND_REGISTRY`
`coerce`	Whether to coerce series to specified type.
`dtype`	Get the dtype property.
`dtypes`	A dict where the keys are column names and values are `DataType` s for the column.
`properties`	Get the properties of the schema for serialization purposes.
`unique`	List of columns that should be jointly unique.

Methods

`__init__`	Initialize DataFrameSchema validator.
`coerce_dtype`	Coerce object to the expected type.
`from_json`	Create DataFrameSchema from json file.
`from_yaml`	Create DataFrameSchema from yaml file.
`get_dtypes`	Same as the `dtype` property, but expands columns where `regex == True` based on the supplied dataframe.
`get_metadata`	Provide metadata for columns and schema level
`to_ddl`	Recover fields of DataFrameSchema as a Pyspark DDL string.
`to_json`	Write DataFrameSchema to json file.
`to_script`	Create DataFrameSchema from yaml file.
`to_structtype`	Recover fields of DataFrameSchema as a Pyspark StructType object.
`to_yaml`	Write DataFrameSchema to yaml file.
`validate`	Check if all columns in a dataframe have a column in the Schema.
`__call__`	Alias for `DataFrameSchema.validate()` method.