Data Format Conversion

new in 0.9.0

The class-based API provides configuration options for converting data to/from supported serialization formats in the context of check_types() -decorated functions.

Note

Currently, pandera.typing.pandas.DataFrame is the only data type that supports this feature.

Consider this simple example:

import pandera as pa
from pandera.typing import DataFrame, Series

class InSchema(pa.SchemaModel):
    str_col: Series[str] = pa.Field(unique=True, isin=[*"abcd"])
    int_col: Series[int]

class OutSchema(InSchema):
    float_col: pa.typing.Series[float]

@pa.check_types
def transform(df: DataFrame[InSchema]) -> DataFrame[OutSchema]:
    return df.assign(float_col=1.1)

With the schema type annotations and check_types() decorator, the transform function validates DataFrame inputs and outputs according to the InSchema and OutSchema definitions.

But what if your input data is serialized in parquet format, and you want to read it into memory, validate the DataFrame, and then pass it to a downstream function for further analysis? Similarly, what if you want the output of transform to be a list of dictionary records instead of a pandas DataFrame?

The to/from_format Configuration Options

To easily fulfill the use cases described above, you can implement the read/write logic by hand, or you can configure schemas to do so. We can first define a subclass of InSchema with additional configuration so that our transform function can read data directly from parquet files or buffers:

class InSchemaParquet(InSchema):
    class Config:
        from_format = "parquet"

Then, we define subclass of OutSchema to specify that transform should output a list of dictionaries representing the rows of the output dataframe.

class OutSchemaDict(OutSchema):
    class Config:
        to_format = "dict"
        to_format_kwargs = {"orient": "records"}

Note that the {to/from}_format_kwargs configuration option should be supplied with a dictionary of key-word arguments to be passed into the respective pandas to_{format} method.

Finally, we redefine our transform function:

@pa.check_types
def transform(df: DataFrame[InSchemaParquet]) -> DataFrame[OutSchemaDict]:
    return df.assign(float_col=1.1)

We can test this out using a buffer to store the parquet file.

Note

A string or path-like object representing the filepath to a parquet file would also be a valid input to transform.

import io
import json

buffer = io.BytesIO()
data = pd.DataFrame({"str_col": [*"abc"], "int_col": range(3)})
data.to_parquet(buffer)
buffer.seek(0)

dict_output = transform(buffer)
print(json.dumps(dict_output, indent=4))
[
    {
        "str_col": "a",
        "int_col": 0,
        "float_col": 1.1
    },
    {
        "str_col": "b",
        "int_col": 1,
        "float_col": 1.1
    },
    {
        "str_col": "c",
        "int_col": 2,
        "float_col": 1.1
    }
]

Takeaway

Data Format Conversion using the {to/from}_format configuration option can modify the behavior of check_types() -decorated functions to convert input data from a particular serialization format into a dataframe. Additionally, you can convert the output data from a dataframe to potentially another format.

This dovetails well with the FastAPI Integration for validating the inputs and outputs of app endpoints.