pandera.api.pyspark.components.ColumnΒΆ

class pandera.api.pyspark.components.Column(dtype=None, checks=None, nullable=False, coerce=False, required=True, name=None, regex=False, title=None, description=None, metadata=None)[source]ΒΆ

Validate types and properties of DataFrame columns.

Create column validator object.

Parameters:
Raises:

SchemaInitError – if impossible to build schema from parameters

Example:

>>> import pyspark as ps
>>> from pyspark.sql import SparkSession
>>> import pandera.pyspark as pa
>>>
>>>
>>> schema = pa.DataFrameSchema({
...     "column": pa.Column(str)
... })
>>> spark = SparkSession.builder.getOrCreate()
>>> schema.validate(spark.createDataFrame([{"column": "foo"},{ "column":"bar"}])).show()
    +------+
    |column|
    +------+
    |   foo|
    |   bar|
    +------+

See here for more usage details.

Attributes

BACKEND_REGISTRY

dtype

Get the dtype property.

properties

Get column properties.

Methods

__init__(dtype=None, checks=None, nullable=False, coerce=False, required=True, name=None, regex=False, title=None, description=None, metadata=None)[source]ΒΆ

Create column validator object.

Parameters:
Raises:

SchemaInitError – if impossible to build schema from parameters

Example:

>>> import pyspark as ps
>>> from pyspark.sql import SparkSession
>>> import pandera.pyspark as pa
>>>
>>>
>>> schema = pa.DataFrameSchema({
...     "column": pa.Column(str)
... })
>>> spark = SparkSession.builder.getOrCreate()
>>> schema.validate(spark.createDataFrame([{"column": "foo"},{ "column":"bar"}])).show()
    +------+
    |column|
    +------+
    |   foo|
    |   bar|
    +------+

See here for more usage details.

get_regex_columns(check_obj)[source]ΒΆ

Get matching column names based on regex column name pattern.

Parameters:

columns – columns to regex pattern match

Return type:

Iterable

Returns:

matching columns

static register_default_backends(check_obj_cls)[source]ΒΆ

Register default backends.

This method is invoked in the get_backend method so that the appropriate validation backend is loaded at validation time instead of schema-definition time.

This method needs to be implemented by the schema subclass.

validate(check_obj, head=None, tail=None, sample=None, random_state=None, lazy=False, inplace=False)[source]ΒΆ

Validate a Column in a DataFrame object.

Parameters:
  • check_obj (~PySparkFrame) – pyspark DataFrame to validate.

  • head (UnionType[int, None]) – validate the first n rows. Rows overlapping with tail or sample are de-duplicated.

  • tail (UnionType[int, None]) – validate the last n rows. Rows overlapping with head or sample are de-duplicated.

  • sample (UnionType[int, None]) – validate a random sample of fractional rows. Rows overlapping with head or tail are de-duplicated.

  • random_state (UnionType[int, None]) – random seed for the sample argument.

  • lazy (bool) – if True, lazily evaluates dataframe against all validation checks and raises a SchemaErrors. Otherwise, raise SchemaError as soon as one occurs.

  • inplace (bool) – if True, applies coercion to the object of validation, otherwise creates a copy of the data.

  • error_handler – pyspark error handler object to provide the error in a dictionary format.

Return type:

~PySparkFrame

Returns:

validated DataFrame.

__call__(check_obj, head=None, tail=None, sample=None, random_state=None, lazy=False, inplace=False)[source]ΒΆ

Alias for validate method.

Return type:

~TDataObject