Data Format Conversion#

new in 0.9.0

The class-based API provides configuration options for converting data to/from supported serialization formats in the context of check_types() -decorated functions.

Note

Currently, pandera.typing.pandas.DataFrame is the only data type that supports this feature.

Consider this simple example:

import pandera as pa
from pandera.typing import DataFrame, Series

class InSchema(pa.DataFrameModel):
    str_col: Series[str] = pa.Field(unique=True, isin=[*"abcd"])
    int_col: Series[int]

class OutSchema(InSchema):
    float_col: pa.typing.Series[float]

@pa.check_types
def transform(df: DataFrame[InSchema]) -> DataFrame[OutSchema]:
    return df.assign(float_col=1.1)

With the schema type annotations and check_types() decorator, the transform function validates DataFrame inputs and outputs according to the InSchema and OutSchema definitions.

But what if your input data is serialized in parquet format, and you want to read it into memory, validate the DataFrame, and then pass it to a downstream function for further analysis? Similarly, what if you want the output of transform to be a list of dictionary records instead of a pandas DataFrame?

The to/from_format Configuration Options#

To easily fulfill the use cases described above, you can implement the read/write logic by hand, or you can configure schemas to do so. We can first define a subclass of InSchema with additional configuration so that our transform function can read data directly from parquet files or buffers:

class InSchemaParquet(InSchema):
    class Config:
        from_format = "parquet"

Then, we define subclass of OutSchema to specify that transform should output a list of dictionaries representing the rows of the output dataframe.

class OutSchemaDict(OutSchema):
    class Config:
        to_format = "dict"
        to_format_kwargs = {"orient": "records"}

Note that the {to/from}_format_kwargs configuration option should be supplied with a dictionary of key-word arguments to be passed into the respective pandas {to/from}_format method.

Finally, we redefine our transform function:

@pa.check_types
def transform(df: DataFrame[InSchemaParquet]) -> DataFrame[OutSchemaDict]:
    return df.assign(float_col=1.1)

We can test this out using a buffer to store the parquet file.

Note

A string or path-like object representing the filepath to a parquet file would also be a valid input to transform.

import io
import json

buffer = io.BytesIO()
data = pd.DataFrame({"str_col": [*"abc"], "int_col": range(3)})
data.to_parquet(buffer)
buffer.seek(0)

dict_output = transform(buffer)
print(json.dumps(dict_output, indent=4))
[
    {
        "str_col": "a",
        "int_col": 0,
        "float_col": 1.1
    },
    {
        "str_col": "b",
        "int_col": 1,
        "float_col": 1.1
    },
    {
        "str_col": "c",
        "int_col": 2,
        "float_col": 1.1
    }
]

Custom Converters with Callables#

In addition to specifying a literal string argument for from_format a generic callable that returns a pandas dataframe can be passed. For example, pd.read_excel, pd.read_sql, or pd.read_gbq. Depending on the function passed, some of the kwargs arguments may be required rather than optional in from_format_kwargs (pd.read_sql requires a connection object).

A callable can also be an argument for the to_format parameter, with the additional, optional, to_format_buffer parameter. Some pandas dataframe writing methods, such as pd.to_pickle, have a required path argument, that must be either a string file path or a bytes object. An example for writing data to a pickle file would be:

import tempfile

def custom_to_pickle(data, *args, **kwargs):
    return data.to_pickle(*args, **kwargs)

def custom_to_pickle_buffer():
    """Create a named temporary file handle to write the pickle file."""
    return tempfile.NamedTemporaryFile()

class OutSchemaPickleCallable(OutSchema):
    class Config:
        to_format = custom_to_pickle

        # If provided, the output of this function will be supplied as
        # the first positional argument to the ``to_format`` function.
        to_format_buffer = custom_to_pickle_buffer

In this example, we use a custom_to_pickle_buffer function as the to_format_buffer property, which returns a tempfile.NamedTemporaryFile(). This will be supplied as a positional argument to the custom_to_pickle function.

The full set of configuration options are:

Title#

Format

Argument

dict

“dict”

csv

“csv”

json

“json”

feather

“feather”

parquet

“parquet”

pickle

“pickle”

Callable

Callable

Takeaway#

Data Format Conversion using the {to/from}_format configuration option can modify the behavior of check_types() -decorated functions to convert input data from a particular serialization format into a dataframe. Additionally, you can convert the output data from a dataframe to potentially another format.

This dovetails well with the FastAPI Integration for validating the inputs and outputs of app endpoints.