.. pandera documentation for synthesizing data
.. currentmodule:: pandera
.. _data synthesis strategies:
Data Synthesis Strategies
=========================
*new in 0.6.0*
``pandera`` provides a utility for generating synthetic data purely from
pandera schema or schema component objects. Under the hood, the schema metadata
is collected to create a data-generating strategy using
`hypothesis `__, which is a
property-based testing library.
Basic Usage
-----------
Once you've defined a schema, it's easy to generate examples:
.. testcode:: data_synthesis_strategies
:skipif: SKIP_STRATEGY
import pandera as pa
schema = pa.DataFrameSchema(
{
"column1": pa.Column(int, pa.Check.eq(10)),
"column2": pa.Column(float, pa.Check.eq(0.25)),
"column3": pa.Column(str, pa.Check.eq("foo")),
}
)
print(schema.example(size=3))
.. testoutput:: data_synthesis_strategies
:skipif: SKIP_STRATEGY
column1 column2 column3
0 10 0.25 foo
1 10 0.25 foo
2 10 0.25 foo
Note that here we've constrained the specific values in each column using
:class:`~pandera.api.checks.Check` s in order to make the data generation process
deterministic for documentation purposes.
Usage in Unit Tests
-------------------
The ``example`` method is available for all schemas and schema components, and
is primarily meant to be used interactively. It *could* be used in a script to
generate test cases, but ``hypothesis`` recommends against doing this and
instead using the ``strategy`` method to create a ``hypothesis`` strategy
that can be used in ``pytest`` unit tests.
.. testcode:: data_synthesis_strategies
:skipif: SKIP_STRATEGY
import hypothesis
def processing_fn(df):
return df.assign(column4=df.column1 * df.column2)
@hypothesis.given(schema.strategy(size=5))
def test_processing_fn(dataframe):
result = processing_fn(dataframe)
assert "column4" in result
The above example is trivial, but you get the idea! Schema objects can create
a ``strategy`` that can then be collected by a `pytest `__
runner. We could also run the tests explicitly ourselves, or run it as a
``unittest.TestCase``. For more information on testing with hypothesis, see the
`hypothesis quick start guide `__.
A more practical example involves using
:ref:`schema transformations`. We can modify
the function above to make sure that ``processing_fn`` actually outputs the
correct result:
.. testcode:: data_synthesis_strategies
:skipif: SKIP_STRATEGY
out_schema = schema.add_columns({"column4": pa.Column(float)})
@pa.check_output(out_schema)
def processing_fn(df):
return df.assign(column4=df.column1 * df.column2)
@hypothesis.given(schema.strategy(size=5))
def test_processing_fn(dataframe):
processing_fn(dataframe)
Now the ``test_processing_fn`` simply becomes an execution test, raising a
:class:`~pandera.errors.SchemaError` if ``processing_fn`` doesn't add
``column4`` to the dataframe.
Strategies and Examples from DataFrame Models
---------------------------------------------
You can also use the :ref:`class-based API` to generate examples.
Here's the equivalent dataframe model for the above examples:
.. testcode:: data_synthesis_strategies
:skipif: SKIP_STRATEGY
from pandera.typing import Series, DataFrame
class InSchema(pa.DataFrameModel):
column1: Series[int] = pa.Field(eq=10)
column2: Series[float] = pa.Field(eq=0.25)
column3: Series[str] = pa.Field(eq="foo")
class OutSchema(InSchema):
column4: Series[float]
@pa.check_types
def processing_fn(df: DataFrame[InSchema]) -> DataFrame[OutSchema]:
return df.assign(column4=df.column1 * df.column2)
@hypothesis.given(InSchema.strategy(size=5))
def test_processing_fn(dataframe):
processing_fn(dataframe)
Checks as Constraints
---------------------
As you may have noticed in the first example, :class:`~pandera.api.checks.Check` s
further constrain the data synthesized from a strategy. Without checks, the
``example`` method would simply generate any value of the specified type. You
can specify multiple checks on a column and ``pandera`` should be able to
generate valid data under those constraints.
.. testcode:: data_synthesis_strategies
:skipif: SKIP_STRATEGY
schema_multiple_checks = pa.DataFrameSchema({
"column1": pa.Column(
float, checks=[
pa.Check.gt(0),
pa.Check.lt(1e10),
pa.Check.notin([-100, -10, 0]),
]
)
})
for _ in range(5):
# generate 10 rows of the dataframe
sample_data = schema_multiple_checks.example(size=3)
# validate the sampled data
schema_multiple_checks(sample_data)
One caveat here is that it's up to you to define a set of checks that are
jointly satisfiable. If not, an ``Unsatisfiable`` exception will be raised:
.. testcode:: data_synthesis_strategies
:skipif: SKIP_STRATEGY
schema_multiple_checks = pa.DataFrameSchema({
"column1": pa.Column(
float, checks=[
# nonsensical constraints
pa.Check.gt(0),
pa.Check.lt(-10),
]
)
})
schema_multiple_checks.example(size=3)
.. testoutput:: data_synthesis_strategies
Traceback (most recent call last):
...
Unsatisfiable: Unable to satisfy assumptions of hypothesis example_generating_inner_function.
.. _check strategy chaining:
Check Strategy Chaining
~~~~~~~~~~~~~~~~~~~~~~~
If you specify multiple checks for a particular column, this is what happens
under the hood:
- The first check in the list is the *base strategy*, which ``hypothesis``
uses to generate data.
- All subsequent checks filter the values generated by the previous strategy
such that it fulfills the constraints of current check.
To optimize efficiency of the data-generation procedure, make sure to specify
the most restrictive constraint of a column as the *base strategy* and build
other constraints on top of it.
In-line Custom Checks
~~~~~~~~~~~~~~~~~~~~~
One of the strengths of ``pandera`` is its flexibility with regard to defining
custom checks on the fly:
.. testcode:: data_synthesis_strategies
:skipif: SKIP_STRATEGY
schema_inline_check = pa.DataFrameSchema({
"col": pa.Column(str, pa.Check(lambda s: s.isin({"foo", "bar"})))
})
One of the disadvantages of this is that the fallback strategy is to simply
apply the check to the generated data, which can be highly inefficient. In this
case, ``hypothesis`` will generate strings and try to find examples of strings
that are in the set ``{"foo", "bar"}``, which will be very slow and most likely
raise an ``Unsatisfiable`` exception. To get around this limitation, you can
register custom checks and define strategies that correspond to them.
.. _custom_strategies:
Defining Custom Strategies via the ``strategy`` kwarg
-----------------------------------------------------
The :class:`~pandera.api.checks.Check` constructor exposes a ``strategy``
keyword argument that allows you to define a data synthesis strategy that can
work as a *base strategy* or *chained strategy*. For example, suppose you define
a custom check that makes sure values in a column are in some specified range.
.. testcode:: data_synthesis_strategies
:skipif: SKIP_STRATEGY
check = pa.Check(lambda x: x.between(0, 100))
You can then define a strategy for this check with:
.. testcode:: data_synthesis_strategies
:skipif: SKIP_STRATEGY
def in_range_strategy(pandera_dtype, strategy=None):
if strategy is None:
# handle base strategy case
return st.floats(min_value=min_val, max_value=max_val).map(
# the map isn't strictly necessary, but shows an example of
# using the pandera_dtype argument
strategies.to_numpy_dtype(pandera_dtype).type
)
# handle chained strategy case
return strategy.filter(lambda val: 0 <= val <= 10)
check = pa.Check(lambda x: x.between(0, 100), strategy=in_range_strategy)
Notice that the ``in_range_strategy`` function takes two arguments: ``pandera_dtype``,
and ``strategy``. ``pandera_dtype`` is required, since this is almost always
required information when generating data. The ``strategy`` argument is optional,
where the default case assumes a *base strategy*, where the check is specified
as the first one in the list of checks specified at the column- or dataframe- level.
Defining Custom Strategies via Check Registration
-------------------------------------------------
All built-in :class:`~pandera.api.checks.Check` s are associated with a data
synthesis strategy. You can define your own data synthesis strategies by using
the :ref:`extensions API` to register a custom check function with
a corresponding strategy.