pandera.api.pyspark.model.DataFrameModel¶
- class pandera.api.pyspark.model.DataFrameModel(*args, **kwargs)[source]¶
Definition of a
DataFrameSchema.new in 0.16.0
See the User Guide for more.
Check if all columns in a dataframe have a column in the Schema.
- Parameters:
check_obj – DataFrame object i.e. the dataframe to be validated.
head – Not used since spark has no concept of head or tail
tail – Not used since spark has no concept of head or tail
sample – validate a random sample of n% rows. Value ranges from 0-1, for example 10% rows can be sampled using setting value as 0.1. refer below documentation. https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.sample.html
random_state – random seed for the
sampleargument.lazy – if True, lazily evaluates dataframe against all validation checks and raises a
SchemaErrors. Otherwise, raiseSchemaErroras soon as one occurs.inplace – if True, applies coercion to the object of validation, otherwise creates a copy of the data.
- Returns:
validated
DataFrame- Raises:
SchemaError – when
DataFrameviolates built-in or custom checks.- Example:
Calling
schema.validatereturns the dataframe.>>> import pandera.pyspark as psa >>> from pyspark.sql import SparkSession >>> import pyspark.sql.types as T >>> spark = SparkSession.builder.getOrCreate() >>> >>> data = [("Bread", 9), ("Butter", 15)] >>> spark_schema = T.StructType( ... [ ... T.StructField("product", T.StringType(), False), ... T.StructField("price", T.IntegerType(), False), ... ], ... ) >>> df = spark.createDataFrame(data=data, schema=spark_schema) >>> >>> schema_withchecks = psa.DataFrameSchema( ... columns={ ... "product": psa.Column("str", checks=psa.Check.str_startswith("B")), ... "price": psa.Column("int", checks=psa.Check.gt(5)), ... }, ... name="product_schema", ... description="schema for product info", ... title="ProductSchema", ... ) >>> >>> schema_withchecks.validate(df).take(2) [Row(product='Bread', price=9), Row(product='Butter', price=15)]
Methods
- classmethod pydantic_validate(schema_model)[source]¶
Verify that the input is a compatible dataframe model.
- Return type:
- classmethod to_ddl()[source]¶
Recover fields of DataFrameModel as a PySpark DDL string.
- Return type:
- Returns:
String with current model fields, in compact DDL format.
- classmethod to_schema()[source]¶
Create
DataFrameSchemafrom theDataFrameModel.- Return type:
- classmethod to_structtype()[source]¶
Recover fields of DataFrameModel as a PySpark StructType object.
- Return type:
- Returns:
StructType object with current model fields.
- classmethod validate(check_obj, head=None, tail=None, sample=None, random_state=None, lazy=True, inplace=False)[source]¶
Check if all columns in a dataframe have a column in the Schema.
- Parameters:
check_obj (
DataFrame) – DataFrame object i.e. the dataframe to be validated.head (
UnionType[int,None]) – Not used since spark has no concept of head or tailtail (
UnionType[int,None]) – Not used since spark has no concept of head or tailsample (
UnionType[int,None]) – validate a random sample of n% rows. Value ranges from 0-1, for example 10% rows can be sampled using setting value as 0.1. refer below documentation. https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.sample.htmlrandom_state (
UnionType[int,None]) – random seed for thesampleargument.lazy (
bool) – if True, lazily evaluates dataframe against all validation checks and raises aSchemaErrors. Otherwise, raiseSchemaErroras soon as one occurs.inplace (
bool) – if True, applies coercion to the object of validation, otherwise creates a copy of the data.
- Return type:
DataFrame[~TDataFrameModel]- Returns:
validated
DataFrame- Raises:
SchemaError – when
DataFrameviolates built-in or custom checks.- Example:
Calling
schema.validatereturns the dataframe.>>> import pandera.pyspark as psa >>> from pyspark.sql import SparkSession >>> import pyspark.sql.types as T >>> spark = SparkSession.builder.getOrCreate() >>> >>> data = [("Bread", 9), ("Butter", 15)] >>> spark_schema = T.StructType( ... [ ... T.StructField("product", T.StringType(), False), ... T.StructField("price", T.IntegerType(), False), ... ], ... ) >>> df = spark.createDataFrame(data=data, schema=spark_schema) >>> >>> schema_withchecks = psa.DataFrameSchema( ... columns={ ... "product": psa.Column("str", checks=psa.Check.str_startswith("B")), ... "price": psa.Column("int", checks=psa.Check.gt(5)), ... }, ... name="product_schema", ... description="schema for product info", ... title="ProductSchema", ... ) >>> >>> schema_withchecks.validate(df).take(2) [Row(product='Bread', price=9), Row(product='Butter', price=15)]