pandera.api.pyspark.components.ColumnΒΆ
- class pandera.api.pyspark.components.Column(dtype=None, checks=None, nullable=False, coerce=False, required=True, name=None, regex=False, title=None, description=None, metadata=None)[source]ΒΆ
Validate types and properties of DataFrame columns.
Create column validator object.
- Parameters:
dtype (
Union[str,int,float,bool,type,DataType,BooleanType,StringType,IntegerType,DecimalType,FloatType,DateType,TimestampType,DoubleType,ShortType,ByteType,LongType,BinaryType]) β datatype of the column. The datatype for type-checking a dataframe. If a string is specified, then assumes one of the valid pyspark string values: https://spark.apache.org/docs/latest/sql-ref-datatypes.htmlchecks (
Union[Check,list[Check],None]) β checks to verify validity of the columnnullable (
bool) β Whether or not column can contain null values.coerce (
bool) β If True, when schema.validate is called the column will be coerced into the specified dtype. This has no effect on columns wheredtype=None.required (
bool) β Whether or not column is allowed to be missingname (
Union[str,tuple[str, β¦],None]) β column name in dataframe to validate.regex (
bool) β whether thenameattribute should be treated as a regex pattern to apply to multiple columns in a dataframe.title (
Optional[str]) β A human-readable label for the column.description (
Optional[str]) β An arbitrary textual description of the column.
- Raises:
SchemaInitError β if impossible to build schema from parameters
- Example:
>>> import pyspark as ps >>> from pyspark.sql import SparkSession >>> import pandera.pyspark as pa >>> >>> >>> schema = pa.DataFrameSchema({ ... "column": pa.Column(str) ... }) >>> spark = SparkSession.builder.getOrCreate() >>> schema.validate(spark.createDataFrame([{"column": "foo"},{ "column":"bar"}])).show() +------+ |column| +------+ | foo| | bar| +------+
See here for more usage details.
Attributes
BACKEND_REGISTRYdtypeGet the pyspark dtype
propertiesGet column properties.
Methods
- __init__(dtype=None, checks=None, nullable=False, coerce=False, required=True, name=None, regex=False, title=None, description=None, metadata=None)[source]ΒΆ
Create column validator object.
- Parameters:
dtype (
Union[str,int,float,bool,type,DataType,BooleanType,StringType,IntegerType,DecimalType,FloatType,DateType,TimestampType,DoubleType,ShortType,ByteType,LongType,BinaryType]) β datatype of the column. The datatype for type-checking a dataframe. If a string is specified, then assumes one of the valid pyspark string values: https://spark.apache.org/docs/latest/sql-ref-datatypes.htmlchecks (
Union[Check,list[Check],None]) β checks to verify validity of the columnnullable (
bool) β Whether or not column can contain null values.coerce (
bool) β If True, when schema.validate is called the column will be coerced into the specified dtype. This has no effect on columns wheredtype=None.required (
bool) β Whether or not column is allowed to be missingname (
Union[str,tuple[str, β¦],None]) β column name in dataframe to validate.regex (
bool) β whether thenameattribute should be treated as a regex pattern to apply to multiple columns in a dataframe.title (
Optional[str]) β A human-readable label for the column.description (
Optional[str]) β An arbitrary textual description of the column.
- Raises:
SchemaInitError β if impossible to build schema from parameters
- Example:
>>> import pyspark as ps >>> from pyspark.sql import SparkSession >>> import pandera.pyspark as pa >>> >>> >>> schema = pa.DataFrameSchema({ ... "column": pa.Column(str) ... }) >>> spark = SparkSession.builder.getOrCreate() >>> schema.validate(spark.createDataFrame([{"column": "foo"},{ "column":"bar"}])).show() +------+ |column| +------+ | foo| | bar| +------+
See here for more usage details.
- get_regex_columns(check_obj)[source]ΒΆ
Get matching column names based on regex column name pattern.
- Parameters:
columns β columns to regex pattern match
- Return type:
- Returns:
matching columns
- set_name(name)[source]ΒΆ
Used to set or modify the name of a column object.
- Parameters:
name (str) β the name of the column object
- validate(check_obj, head=None, tail=None, sample=None, random_state=None, lazy=True, inplace=False, error_handler=None)[source]ΒΆ
Validate a Column in a DataFrame object.
- Parameters:
check_obj (
DataFrame) β pyspark DataFrame to validate.head (
Optional[int]) β validate the first n rows. Rows overlapping with tail or sample are de-duplicated.tail (
Optional[int]) β validate the last n rows. Rows overlapping with head or sample are de-duplicated.sample (
Optional[int]) β validate a random sample of fractional rows. Rows overlapping with head or tail are de-duplicated.random_state (
Optional[int]) β random seed for thesampleargument.lazy (
bool) β if True, lazily evaluates dataframe against all validation checks and raises aSchemaErrors. Otherwise, raiseSchemaErroras soon as one occurs.inplace (
bool) β if True, applies coercion to the object of validation, otherwise creates a copy of the data.error_handler (
ErrorHandler) β pyspark error handler object to provide the error in a dictionary format.
- Return type:
- Returns:
validated DataFrame.