Data Validation with Dask

new in 0.8.0

Dask is a distributed compute framework that offers a pandas-like dataframe API. You can use pandera to validate DataFrame() and Series() objects directly. First, install pandera with the dask extra:

pip install 'pandera[dask]'

Then you can use pandera schemas to validate dask dataframes. In the example below we’ll use the class-based API to define a DataFrameModel for validation.

import dask.dataframe as dd
import pandas as pd
import pandera as pa

from pandera.typing.dask import DataFrame, Series


class Schema(pa.DataFrameModel):
    state: Series[str]
    city: Series[str]
    price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})


ddf = dd.from_pandas(
    pd.DataFrame(
        {
            'state': ['FL','FL','FL','CA','CA','CA'],
            'city': [
                'Orlando',
                'Miami',
                'Tampa',
                'San Francisco',
                'Los Angeles',
                'San Diego',
            ],
            'price': [8, 12, 10, 16, 20, 18],
        }
    ),
    npartitions=2
)
pandera_ddf = Schema(ddf)
pandera_ddf
Dask DataFrame Structure:
state city price
npartitions=2
0 string string int64
3 ... ... ...
5 ... ... ...
Dask Name: validate, 2 expressions

As you can see, passing the dask dataframe into Schema will produce another dask dataframe which hasn’t been evaluated yet. What this means is that pandera will only validate when the dask graph is evaluated.

pandera_ddf.compute()
state city price
0 FL Orlando 8
1 FL Miami 12
2 FL Tampa 10
3 CA San Francisco 16
4 CA Los Angeles 20
5 CA San Diego 18

You can also use the check_types() decorator to validate dask dataframes at runtime:

@pa.check_types
def function(ddf: DataFrame[Schema]) -> DataFrame[Schema]:
    return ddf[ddf["state"] == "CA"]

function(ddf).compute()
state city price
3 CA San Francisco 16
4 CA Los Angeles 20
5 CA San Diego 18

And of course, you can use the object-based API to validate dask dataframes:

schema = pa.DataFrameSchema({
    "state": pa.Column(str),
    "city": pa.Column(str),
    "price": pa.Column(int, pa.Check.in_range(min_value=5, max_value=20))
})
schema(ddf).compute()
state city price
0 FL Orlando 8
1 FL Miami 12
2 FL Tampa 10
3 CA San Francisco 16
4 CA Los Angeles 20
5 CA San Diego 18