pandera.api.pyspark.components.Column¶

class pandera.api.pyspark.components.Column(dtype=None, checks=None, nullable=False, coerce=False, required=True, name=None, regex=False, title=None, description=None, metadata=None)[source]¶

Validate types and properties of DataFrame columns.

Create column validator object.

Parameters:

dtype (Union[str, int, float, bool, type, DataType, Type, BooleanType, StringType, IntegerType, DecimalType, FloatType, DateType, TimestampType, DoubleType, ShortType, ByteType, LongType, BinaryType]) – datatype of the column. The datatype for type-checking a dataframe. If a string is specified, then assumes one of the valid pyspark string values: https://spark.apache.org/docs/latest/sql-ref-datatypes.html
checks (Union[Check, List[Check], None]) – checks to verify validity of the column
nullable (bool) – Whether or not column can contain null values.
coerce (bool) – If True, when schema.validate is called the column will be coerced into the specified dtype. This has no effect on columns where dtype=None.
required (bool) – Whether or not column is allowed to be missing
name (Union[str, Tuple[str, …], None]) – column name in dataframe to validate.
regex (bool) – whether the name attribute should be treated as a regex pattern to apply to multiple columns in a dataframe.
title (Optional[str]) – A human-readable label for the column.
description (Optional[str]) – An arbitrary textual description of the column.
metadata (Optional[dict]) – An optional key value data.

Raises:

SchemaInitError – if impossible to build schema from parameters

Example:

>>> import pyspark as ps
>>> from pyspark.sql import SparkSession
>>> import pandera.pyspark as pa
>>>
>>>
>>> schema = pa.DataFrameSchema({
...     "column": pa.Column(str)
... })
>>> spark = SparkSession.builder.getOrCreate()
>>> schema.validate(spark.createDataFrame([{"column": "foo"},{ "column":"bar"}])).show()
    +------+
    |column|
    +------+
    |   foo|
    |   bar|
    +------+

See here for more usage details.

Attributes

`BACKEND_REGISTRY`
`dtype`	Get the pyspark dtype
`properties`	Get column properties.

Methods

__init__(dtype=None, checks=None, nullable=False, coerce=False, required=True, name=None, regex=False, title=None, description=None, metadata=None)[source]¶

Create column validator object.

Parameters:

dtype (Union[str, int, float, bool, type, DataType, Type, BooleanType, StringType, IntegerType, DecimalType, FloatType, DateType, TimestampType, DoubleType, ShortType, ByteType, LongType, BinaryType]) – datatype of the column. The datatype for type-checking a dataframe. If a string is specified, then assumes one of the valid pyspark string values: https://spark.apache.org/docs/latest/sql-ref-datatypes.html
checks (Union[Check, List[Check], None]) – checks to verify validity of the column
nullable (bool) – Whether or not column can contain null values.
coerce (bool) – If True, when schema.validate is called the column will be coerced into the specified dtype. This has no effect on columns where dtype=None.
required (bool) – Whether or not column is allowed to be missing
name (Union[str, Tuple[str, …], None]) – column name in dataframe to validate.
regex (bool) – whether the name attribute should be treated as a regex pattern to apply to multiple columns in a dataframe.
title (Optional[str]) – A human-readable label for the column.
description (Optional[str]) – An arbitrary textual description of the column.
metadata (Optional[dict]) – An optional key value data.

Raises:

SchemaInitError – if impossible to build schema from parameters

Example:

>>> import pyspark as ps
>>> from pyspark.sql import SparkSession
>>> import pandera.pyspark as pa
>>>
>>>
>>> schema = pa.DataFrameSchema({
...     "column": pa.Column(str)
... })
>>> spark = SparkSession.builder.getOrCreate()
>>> schema.validate(spark.createDataFrame([{"column": "foo"},{ "column":"bar"}])).show()
    +------+
    |column|
    +------+
    |   foo|
    |   bar|
    +------+

See here for more usage details.

get_regex_columns(check_obj)[source]¶

Get matching column names based on regex column name pattern.

Parameters:: columns – columns to regex pattern match
Return type:: Iterable
Returns:: matching columns

set_name(name)[source]¶

Used to set or modify the name of a column object.

Parameters:: name (str) – the name of the column object

validate(check_obj, head=None, tail=None, sample=None, random_state=None, lazy=True, inplace=False, error_handler=None)[source]¶

Validate a Column in a DataFrame object.

Parameters:

check_obj (DataFrame) – pyspark DataFrame to validate.
head (Optional[int]) – validate the first n rows. Rows overlapping with tail or sample are de-duplicated.
tail (Optional[int]) – validate the last n rows. Rows overlapping with head or sample are de-duplicated.
sample (Optional[int]) – validate a random sample of fractional rows. Rows overlapping with head or tail are de-duplicated.
random_state (Optional[int]) – random seed for the sample argument.
lazy (bool) – if True, lazily evaluates dataframe against all validation checks and raises a SchemaErrors. Otherwise, raise SchemaError as soon as one occurs.
inplace (bool) – if True, applies coercion to the object of validation, otherwise creates a copy of the data.
error_handler (ErrorHandler) – pyspark error handler object to provide the error in a dictionary format.

Return type:

DataFrame

Returns:

validated DataFrame.

__call__(check_obj, head=None, tail=None, sample=None, random_state=None, lazy=False, inplace=False)[source]¶: Alias for validate method.