pandera.api.pyspark.container.DataFrameSchema¶
- class pandera.api.pyspark.container.DataFrameSchema(columns=None, checks=None, parsers=None, index=None, dtype=None, coerce=False, strict=False, name=None, ordered=False, unique=None, report_duplicates='all', unique_column_names=False, add_missing_columns=False, title=None, description=None, metadata=None, drop_invalid_rows=False)[source]¶
A light-weight PySpark DataFrame validator.
Library-agnostic base class for DataFrameSchema definitions.
- Parameters:
columns (mapping of column names and column schema component.) – a dict where keys are column names and values are Column objects specifying the datatypes and properties of a particular column.
checks (
Union[Check,list[Union[Check,Hypothesis]],None]) – dataframe-wide checks.parsers (
Union[Parser,list[Parser],None]) – dataframe-wide parsers.index – specify the datatypes and properties of the index.
dtype (
UnionType[Any,None]) – datatype of the dataframe. This overrides the data types specified in any of the columns. If a string is specified, then assumes one of the valid pandas string values: http://pandas.pydata.org/pandas-docs/stable/basics.html#dtypes.coerce (
bool) – whether or not to coerce all of the columns on validation. This overrides any coerce setting at the column or index level. This has no effect on columns wheredtype=None.strict (
Union[bool,Literal[‘filter’]]) – ensure that all and only the columns defined in the schema are present in the dataframe. If set to ‘filter’, only the columns in the schema will be passed to the validated dataframe. If set to filter and columns defined in the schema are not present in the dataframe, will throw an error.ordered (
bool) – whether or not to validate the columns order.unique (
Union[str,list[str],None]) – a list of columns that should be jointly unique.report_duplicates (
Union[Literal[‘exclude_first’],Literal[‘exclude_last’],Literal[‘all’]]) – how to report unique errors - exclude_first: report all duplicates except first occurrence - exclude_last: report all duplicates except last occurrence - all: (default) report all duplicatesunique_column_names (
bool) – whether or not column names must be unique.add_missing_columns (
bool) – add missing column names with either default value, if specified in column schema, or NaN if column is nullable.title (
UnionType[str,None]) – A human-readable label for the schema.description (
UnionType[str,None]) – An arbitrary textual description of the schema.metadata (
UnionType[dict,None]) – An optional key-value data.drop_invalid_rows (
bool) – if True, drop invalid rows on validation.
- Raises:
SchemaInitError – if impossible to build schema from parameters
- Examples:
>>> import pandera.pandas as pa >>> >>> schema = pa.DataFrameSchema({ ... "str_column": pa.Column(str), ... "float_column": pa.Column(float), ... "int_column": pa.Column(int), ... "date_column": pa.Column(pa.DateTime), ... })
Use the pandas API to define checks, which takes a function with the signature:
pd.Series -> Union[bool, pd.Series]where the output series contains boolean values.>>> schema_withchecks = pa.DataFrameSchema({ ... "probability": pa.Column( ... float, pa.Check(lambda s: (s >= 0) & (s <= 1))), ... ... # check that the "category" column contains a few discrete ... # values, and the majority of the entries are dogs. ... "category": pa.Column( ... str, [ ... pa.Check(lambda s: s.isin(["dog", "cat", "duck"])), ... pa.Check(lambda s: (s == "dog").mean() > 0.5), ... ]), ... })
See here for more usage details.
Attributes
BACKEND_REGISTRYcoerceWhether to coerce series to specified type.
dtypeGet the dtype property.
dtypesA dict where the keys are column names and values are
DataTypes for the column.propertiesGet the properties of the schema for serialization purposes.
uniqueList of columns that should be jointly unique.
Methods
- classmethod from_json(source)[source]¶
Load schema from JSON (see
pandera.io.pyspark_sql_io).- Return type:
Self
- classmethod from_yaml(yaml_schema)[source]¶
Load schema from YAML (see
pandera.io.pyspark_sql_io).- Return type:
Self
- static register_default_backends(check_obj_cls)[source]¶
Register default backends.
This method is invoked in the get_backend method so that the appropriate validation backend is loaded at validation time instead of schema-definition time.
This method needs to be implemented by the schema subclass.
- to_ddl()[source]¶
Recover fields of DataFrameSchema as a Pyspark DDL string.
- Return type:
- Returns:
String with current schema fields, in compact DDL format.
- to_json(target: None = None, *, minimal: bool = True, **kwargs) str[source]¶
- to_json(target: PathLike, *, minimal: bool = True, **kwargs) None
Write schema to JSON (see
pandera.io.pyspark_sql_io).
- to_structtype()[source]¶
Recover fields of DataFrameSchema as a Pyspark StructType object.
- As the output of this method will be used to specify a read schema in Pyspark
(avoiding automatic schema inference), the False nullable properties are just ignored, as this check will be executed by the Pandera validations after a dataset is read.
- Return type:
- Returns:
StructType object with current schema fields.
- to_yaml(stream=None, *, minimal=True)[source]¶
Write schema to YAML (see
pandera.io.pyspark_sql_io).
- validate(check_obj, head=None, tail=None, sample=None, random_state=None, lazy=True, inplace=False)[source]¶
Check if all columns in a dataframe have a column in the Schema.
- Parameters:
check_obj (~PySparkFrame) – DataFrame object i.e. the dataframe to be validated.
head (
UnionType[int,None]) – Not used since spark has no concept of head or tailtail (
UnionType[int,None]) – Not used since spark has no concept of head or tailsample (
UnionType[int,None]) – validate a random sample of n% rows. Value ranges from 0-1, for example 10% rows can be sampled using setting value as 0.1. refer below documentation. https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.sample.htmlrandom_state (
UnionType[int,None]) – random seed for thesampleargument.lazy (
bool) – if True, lazily evaluates dataframe against all validation checks and raises aSchemaErrors. Otherwise, raiseSchemaErroras soon as one occurs.inplace (
bool) – if True, applies coercion to the object of validation, otherwise creates a copy of the data.
- Return type:
~PySparkFrame
- Returns:
validated
DataFrame- Raises:
SchemaError – when
DataFrameviolates built-in or custom checks.- Example:
Calling
schema.validatereturns the dataframe.>>> import pandera.pyspark as psa >>> from pyspark.sql import SparkSession >>> import pyspark.sql.types as T >>> spark = SparkSession.builder.getOrCreate() >>> data = [("Bread", 9), ("Butter", 15)] >>> spark_schema = T.StructType( ... [ ... T.StructField("product", T.StringType(), False), ... T.StructField("price", T.IntegerType(), False), ... ], ... ) >>> df = spark.createDataFrame(data=data, schema=spark_schema) >>> schema_with_checks = psa.DataFrameSchema( ... columns={ ... "product": psa.Column("str", checks=psa.Check.str_startswith("B")), ... "price": psa.Column("int", checks=psa.Check.gt(5)), ... }, ... name="product_schema", ... description="schema for product info", ... title="ProductSchema", ... ) >>> schema_with_checks.validate(df).take(2)
- __call__(dataframe, head=None, tail=None, sample=None, random_state=None, lazy=True, inplace=False)[source]¶
Alias for
DataFrameSchema.validate()method.- Parameters:
dataframe (pd.DataFrame) – the dataframe to be validated.
head (int) – validate the first n rows. Rows overlapping with tail or sample are de-duplicated.
tail (int) – validate the last n rows. Rows overlapping with head or sample are de-duplicated.
sample (
UnionType[int,None]) – validate a random sample of n rows. Rows overlapping with head or tail are de-duplicated.random_state (
UnionType[int,None]) – random seed for thesampleargument.lazy (
bool) – if True, lazily evaluates dataframe against all validation checks and raises aSchemaErrors. Otherwise, raiseSchemaErroras soon as one occurs.inplace (
bool) – if True, applies coercion to the object of validation, otherwise creates a copy of the data.
- Return type:
~PySparkFrame