Preprocessing with Parsers¶
new in 0.19.0
Parsers allow you to do some custom preprocessing on dataframes, columns, and series objects before running the validation checks. This is useful when you want to normalize, clip, or otherwise clean data values before applying validation checks.
Important
This feature is only available in the pandas validation backend.
Parsing versus validation¶
Pandera distinguishes between data validation and parsing. Validation is the act of verifying whether data follows some set of contraints, whereas parsing transforms raw data into some desired set of constraints.
Pandera ships with a few core parsers that you may already be familiar with:
coerce=True
will convert the datatypes of the incoming data to validate. This option is available in bothDataFrameSchema
andColumn
objects. See here for more details.strict="filter"
will remove columns in the data that are not specified in theDataFrameSchema
. See here for more details.add_missing_columns=True
will add missing columns to the data if theColumn
is nullable or specifies a default value. See here.
The Parser
abstraction allows you to specify any
arbitrary transform that occurs before validation so that you can codify
and standardize the preprocessing steps needed to get your raw data into a valid
state.
Important
This feature is currently only supported with the pandas
validation backend.
With parsers, you can codify and reuse preprocessing logic as part of the schema.
Note that this feature is optional, meaning that you can always do preprocessing
before calling schema.validate
with the native dataframe API:
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema({"a": pa.Column(int, pa.Check.ge(0))})
data = pd.DataFrame({"a": [1, 2, -1]})
# clip negative values
data["a"] = data["a"].clip(lower=0)
schema.validate(data)
a | |
---|---|
0 | 1 |
1 | 2 |
2 | 0 |
Let’s encode the preprocessing step as a parser:
schema = pa.DataFrameSchema({
"a": pa.Column(
int,
parsers=pa.Parser(lambda s: s.clip(lower=0)),
checks=pa.Check.ge(0),
)
})
data = pd.DataFrame({"a": [1, 2, -1]})
schema.validate(data)
a | |
---|---|
0 | 1 |
1 | 2 |
2 | 0 |
You can specify both dataframe- and column-level parsers, where dataframe-level parsers are performed before column-level parsers. Assuming that a schema contains parsers and checks, the validation process consists of the following steps:
dataframe-level parsing
column-level parsing
dataframe-level checks
column-level and index-level checks
Parsing columns¶
Parser
objects accept a function as a required
argument, which is expected to take a Series
input and output a parsed
Series
, for example:
import numpy as np
schema = pa.DataFrameSchema({
"sqrt_values": pa.Column(parsers=pa.Parser(lambda s: np.sqrt(s)))
})
schema.validate(pd.DataFrame({"sqrt_values": [1., 2., 3.]}))
sqrt_values | |
---|---|
0 | 1.000000 |
1 | 1.414214 |
2 | 1.732051 |
Multiple parsers can be applied to a column:
Important
The order of parsers
is preserved at validation time.
schema = pa.DataFrameSchema({
"string_numbers": pa.Column(
str,
parsers=[
pa.Parser(lambda s: s.str.zfill(10)),
pa.Parser(lambda s: s.str[2:]),
]
),
})
schema.validate(pd.DataFrame({"string_numbers": ["12345", "67890"]}))
string_numbers | |
---|---|
0 | 00012345 |
1 | 00067890 |
Parsing the dataframe¶
For any dataframe-wide preprocessing logic, you can specify the parsers
kwarg in the DataFrameSchema
object.
schema = pa.DataFrameSchema(
parsers=pa.Parser(lambda df: df.transform("sqrt")),
columns={
"a": pa.Column(float),
"b": pa.Column(float, parsers=pa.Parser(lambda s: s * -1)),
"c": pa.Column(float, parsers=pa.Parser(lambda s: s + 1)),
}
)
data = pd.DataFrame({
"a": [2.0, 4.0, 9.0],
"b": [2.0, 4.0, 9.0],
"c": [2.0, 4.0, 9.0],
})
schema.validate(data)
a | b | c | |
---|---|---|---|
0 | 1.414214 | -1.414214 | 2.414214 |
1 | 2.000000 | -2.000000 | 3.000000 |
2 | 3.000000 | -3.000000 | 4.000000 |
Note
Similar to the column-level parsers, you can also provide a list of Parser
s
at the dataframe level.
Parsers in DataFrameModel
¶
We can write a DataFrameModel
that’s equivalent to the schema above with the
parse()
and
dataframe_parse()
decorators:
class DFModel(pa.DataFrameModel):
a: float
b: float
c: float
@pa.dataframe_parser
def sqrt(cls, df):
return df.transform("sqrt")
@pa.parser("b")
def negate(cls, series):
return series * -1
@pa.parser("c")
def plus_one(cls, series):
return series + 1