Data Validation with Dask

new in 0.8.0

Dask is a distributed compute framework that offers a pandas-like dataframe API. You can use pandera to validate DataFrame() and Series() objects directly. First, install pandera with the dask extra:

pip install pandera[dask]

Then you can use pandera schemas to validate dask dataframes. In the example below we’ll use the class-based API to define a SchemaModel for validation.

import dask.dataframe as dd
import pandas as pd
import pandera as pa

from pandera.typing.dask import DataFrame, Series


class Schema(pa.SchemaModel):
    state: Series[str]
    city: Series[str]
    price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})


ddf = dd.from_pandas(
    pd.DataFrame(
        {
            'state': ['FL','FL','FL','CA','CA','CA'],
            'city': [
                'Orlando',
                'Miami',
                'Tampa',
                'San Francisco',
                'Los Angeles',
                'San Diego',
            ],
            'price': [8, 12, 10, 16, 20, 18],
        }
    ),
    npartitions=2
)
pandera_ddf = Schema(ddf)

print(pandera_ddf)
Dask DataFrame Structure:
                state    city  price
npartitions=2
0              object  object  int64
3                 ...     ...    ...
5                 ...     ...    ...
Dask Name: validate, 4 tasks

As you can see, passing the dask dataframe into Schema will produce another dask dataframe which hasn’t been evaluated yet. What this means is that pandera will only validate when the dask graph is evaluated.

print(pandera_ddf.compute())
  state           city  price
0    FL        Orlando      8
1    FL          Miami     12
2    FL          Tampa     10
3    CA  San Francisco     16
4    CA    Los Angeles     20
5    CA      San Diego     18

You can also use the check_types() decorator to validate dask dataframes at runtime:

@pa.check_types
def function(ddf: DataFrame[Schema]) -> DataFrame[Schema]:
    return ddf[ddf["state"] == "CA"]

print(function(ddf).compute())
  state           city  price
3    CA  San Francisco     16
4    CA    Los Angeles     20
5    CA      San Diego     18

And of course, you can use the object-based API to validate dask dataframes:

schema = pa.DataFrameSchema({
    "state": pa.Column(str),
    "city": pa.Column(str),
    "price": pa.Column(int, pa.Check.in_range(min_value=5, max_value=20))
})
print(schema(ddf).compute())
  state           city  price
0    FL        Orlando      8
1    FL          Miami     12
2    FL          Tampa     10
3    CA  San Francisco     16
4    CA    Los Angeles     20
5    CA      San Diego     18