.. currentmodule:: pandera .. _scaling_modin: Data Validation with Modin ========================== *new in 0.8.0* `Modin `__ is a distributed compute framework that offers a pandas drop-in replacement dataframe implementation. You can use pandera to validate :py:func:`~modin.pandas.DataFrame` and :py:func:`~modin.pandas.Series` objects directly. First, install ``pandera`` with the ``dask`` extra: .. code:: bash pip install pandera[modin] # installs both ray and dask backends pip install pandera[modin-ray] # only ray backend pip install pandera[modin-dask] # only dask backend Then you can use pandera schemas to validate modin dataframes. In the example below we'll use the :ref:`class-based API ` to define a :py:class:`~pandera.api.model.pandas.DataFrameModel` for validation. .. testcode:: scaling_modin :skipif: SKIP_MODIN import modin.pandas as pd import pandas as pd import pandera as pa from pandera.typing.modin import DataFrame, Series class Schema(pa.DataFrameModel): state: Series[str] city: Series[str] price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20}) # create a modin dataframe that's validated on object initialization df = DataFrame[Schema]( { 'state': ['FL','FL','FL','CA','CA','CA'], 'city': [ 'Orlando', 'Miami', 'Tampa', 'San Francisco', 'Los Angeles', 'San Diego', ], 'price': [8, 12, 10, 16, 20, 18], } ) print(df) .. testoutput:: scaling_modin :skipif: SKIP_MODIN state city price 0 FL Orlando 8 1 FL Miami 12 2 FL Tampa 10 3 CA San Francisco 16 4 CA Los Angeles 20 5 CA San Diego 18 You can also use the :py:func:`~pandera.check_types` decorator to validate modin dataframes at runtime: .. testcode:: scaling_modin :skipif: SKIP_MODIN @pa.check_types def function(df: DataFrame[Schema]) -> DataFrame[Schema]: return df[df["state"] == "CA"] print(function(df)) .. testoutput:: scaling_modin :skipif: SKIP_MODIN state city price 3 CA San Francisco 16 4 CA Los Angeles 20 5 CA San Diego 18 And of course, you can use the object-based API to validate dask dataframes: .. testcode:: scaling_modin :skipif: SKIP_MODIN schema = pa.DataFrameSchema({ "state": pa.Column(str), "city": pa.Column(str), "price": pa.Column(int, pa.Check.in_range(min_value=5, max_value=20)) }) print(schema(df)) .. testoutput:: scaling_modin :skipif: SKIP_MODIN state city price 0 FL Orlando 8 1 FL Miami 12 2 FL Tampa 10 3 CA San Francisco 16 4 CA Los Angeles 20 5 CA San Diego 18