Data Validation with Dask¶
new in 0.8.0
Dask is a distributed
compute framework that offers a pandas-like dataframe API.
You can use pandera to validate DataFrame()
and Series()
objects directly. First, install
pandera
with the dask
extra:
pip install 'pandera[dask]'
Then you can use pandera schemas to validate dask dataframes. In the example
below we’ll use the class-based API to define a
DataFrameModel
for validation.
import dask.dataframe as dd
import pandas as pd
import pandera as pa
from pandera.typing.dask import DataFrame, Series
class Schema(pa.DataFrameModel):
state: Series[str]
city: Series[str]
price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})
ddf = dd.from_pandas(
pd.DataFrame(
{
'state': ['FL','FL','FL','CA','CA','CA'],
'city': [
'Orlando',
'Miami',
'Tampa',
'San Francisco',
'Los Angeles',
'San Diego',
],
'price': [8, 12, 10, 16, 20, 18],
}
),
npartitions=2
)
pandera_ddf = Schema(ddf)
pandera_ddf
Dask DataFrame Structure:
state | city | price | |
---|---|---|---|
npartitions=2 | |||
0 | string | string | int64 |
3 | ... | ... | ... |
5 | ... | ... | ... |
Dask Name: validate, 2 expressions
As you can see, passing the dask dataframe into Schema
will produce
another dask dataframe which hasn’t been evaluated yet. What this means is
that pandera will only validate when the dask graph is evaluated.
pandera_ddf.compute()
state | city | price | |
---|---|---|---|
0 | FL | Orlando | 8 |
1 | FL | Miami | 12 |
2 | FL | Tampa | 10 |
3 | CA | San Francisco | 16 |
4 | CA | Los Angeles | 20 |
5 | CA | San Diego | 18 |
You can also use the check_types()
decorator to validate
dask dataframes at runtime:
@pa.check_types
def function(ddf: DataFrame[Schema]) -> DataFrame[Schema]:
return ddf[ddf["state"] == "CA"]
function(ddf).compute()
state | city | price | |
---|---|---|---|
3 | CA | San Francisco | 16 |
4 | CA | Los Angeles | 20 |
5 | CA | San Diego | 18 |
And of course, you can use the object-based API to validate dask dataframes:
schema = pa.DataFrameSchema({
"state": pa.Column(str),
"city": pa.Column(str),
"price": pa.Column(int, pa.Check.in_range(min_value=5, max_value=20))
})
schema(ddf).compute()
state | city | price | |
---|---|---|---|
0 | FL | Orlando | 8 |
1 | FL | Miami | 12 |
2 | FL | Tampa | 10 |
3 | CA | San Francisco | 16 |
4 | CA | Los Angeles | 20 |
5 | CA | San Diego | 18 |