Data Synthesis Strategies¶
new in 0.6.0
pandera
provides a utility for generating synthetic data purely from
pandera schema or schema component objects. Under the hood, the schema metadata
is collected to create a data-generating strategy using
hypothesis, which is a
property-based testing library.
Basic Usage¶
Once you’ve defined a schema, it’s easy to generate examples:
import pandera as pa
schema = pa.DataFrameSchema(
{
"column1": pa.Column(int, pa.Check.eq(10)),
"column2": pa.Column(float, pa.Check.eq(0.25)),
"column3": pa.Column(str, pa.Check.eq("foo")),
}
)
schema.example(size=3)
column1 | column2 | column3 | |
---|---|---|---|
0 | 10 | 0.25 | foo |
1 | 10 | 0.25 | foo |
2 | 10 | 0.25 | foo |
Note that here we’ve constrained the specific values in each column using
Check
s in order to make the data generation process
deterministic for documentation purposes.
Usage in Unit Tests¶
The example
method is available for all schemas and schema components, and
is primarily meant to be used interactively. It could be used in a script to
generate test cases, but hypothesis
recommends against doing this and
instead using the strategy
method to create a hypothesis
strategy
that can be used in pytest
unit tests.
import hypothesis
def processing_fn(df):
return df.assign(column4=df.column1 * df.column2)
@hypothesis.given(schema.strategy(size=5))
def test_processing_fn(dataframe):
result = processing_fn(dataframe)
assert "column4" in result
The above example is trivial, but you get the idea! Schema objects can create
a strategy
that can then be collected by a pytest
runner. We could also run the tests explicitly ourselves, or run it as a
unittest.TestCase
. For more information on testing with hypothesis, see the
hypothesis quick start guide.
A more practical example involves using
schema transformations. We can modify
the function above to make sure that processing_fn
actually outputs the
correct result:
out_schema = schema.add_columns({"column4": pa.Column(float)})
@pa.check_output(out_schema)
def processing_fn(df):
return df.assign(column4=df.column1 * df.column2)
@hypothesis.given(schema.strategy(size=5))
def test_processing_fn(dataframe):
processing_fn(dataframe)
Now the test_processing_fn
simply becomes an execution test, raising a
SchemaError
if processing_fn
doesn’t add
column4
to the dataframe.
Strategies and Examples from DataFrame Models¶
You can also use the class-based API to generate examples. Here’s the equivalent dataframe model for the above examples:
from pandera.typing import Series, DataFrame
class InSchema(pa.DataFrameModel):
column1: Series[int] = pa.Field(eq=10)
column2: Series[float] = pa.Field(eq=0.25)
column3: Series[str] = pa.Field(eq="foo")
class OutSchema(InSchema):
column4: Series[float]
@pa.check_types
def processing_fn(df: DataFrame[InSchema]) -> DataFrame[OutSchema]:
return df.assign(column4=df.column1 * df.column2)
@hypothesis.given(InSchema.strategy(size=5))
def test_processing_fn(dataframe):
processing_fn(dataframe)
Checks as Constraints¶
As you may have noticed in the first example, Check
s
further constrain the data synthesized from a strategy. Without checks, the
example
method would simply generate any value of the specified type. You
can specify multiple checks on a column and pandera
should be able to
generate valid data under those constraints.
schema_multiple_checks = pa.DataFrameSchema({
"column1": pa.Column(
float, checks=[
pa.Check.gt(0),
pa.Check.lt(1e10),
pa.Check.notin([-100, -10, 0]),
]
)
})
for _ in range(5):
# generate 10 rows of the dataframe
sample_data = schema_multiple_checks.example(size=3)
# validate the sampled data
schema_multiple_checks(sample_data)
One caveat here is that it’s up to you to define a set of checks that are
jointly satisfiable. If not, an Unsatisfiable
exception will be raised:
import hypothesis
schema_multiple_checks = pa.DataFrameSchema({
"column1": pa.Column(
float, checks=[
# nonsensical constraints
pa.Check.gt(0),
pa.Check.lt(-10),
]
)
})
schema_multiple_checks.example(size=3)
---------------------------------------------------------------------------
Unsatisfiable Traceback (most recent call last)
Cell In[6], line 13
1 import hypothesis
3 schema_multiple_checks = pa.DataFrameSchema({
4 "column1": pa.Column(
5 float, checks=[
(...)
10 )
11 })
---> 13 schema_multiple_checks.example(size=3)
File ~/checkouts/readthedocs.org/user_builds/pandera/conda/latest/lib/python3.11/site-packages/pandera/api/pandas/container.py:228, in DataFrameSchema.example(self, size, n_regex_columns)
221 with warnings.catch_warnings():
222 warnings.simplefilter(
223 "ignore",
224 category=hypothesis.errors.NonInteractiveExampleWarning,
225 )
226 return self.strategy(
227 size=size, n_regex_columns=n_regex_columns
--> 228 ).example()
File ~/checkouts/readthedocs.org/user_builds/pandera/conda/latest/lib/python3.11/site-packages/hypothesis/strategies/_internal/strategies.py:348, in SearchStrategy.example(self)
336 @given(self)
337 @settings(
338 database=None,
(...)
344 )
345 def example_generating_inner_function(ex):
346 self.__examples.append(ex)
--> 348 example_generating_inner_function()
349 shuffle(self.__examples)
350 return self.__examples.pop()
File ~/checkouts/readthedocs.org/user_builds/pandera/conda/latest/lib/python3.11/site-packages/hypothesis/strategies/_internal/strategies.py:337, in SearchStrategy.example.<locals>.example_generating_inner_function()
332 from hypothesis.core import given
334 # Note: this function has a weird name because it might appear in
335 # tracebacks, and we want users to know that they can ignore it.
336 @given(self)
--> 337 @settings(
338 database=None,
339 max_examples=100,
340 deadline=None,
341 verbosity=Verbosity.quiet,
342 phases=(Phase.generate,),
343 suppress_health_check=list(HealthCheck),
344 )
345 def example_generating_inner_function(ex):
346 self.__examples.append(ex)
348 example_generating_inner_function()
[... skipping hidden 1 frame]
File ~/checkouts/readthedocs.org/user_builds/pandera/conda/latest/lib/python3.11/site-packages/hypothesis/core.py:1207, in StateForActualGivenExecution.run_engine(self)
1205 if runner.valid_examples == 0:
1206 rep = get_pretty_function_description(self.test)
-> 1207 raise Unsatisfiable(f"Unable to satisfy assumptions of {rep}")
1209 # If we have not traced executions, warn about that now (but only when
1210 # we'd expect to do so reliably, i.e. on CPython>=3.12)
1211 if (
1212 sys.version_info[:2] >= (3, 12)
1213 and not PYPY
(...)
1216 ): # pragma: no cover
1217 # actually covered by our tests, but only on >= 3.12
Unsatisfiable: Unable to satisfy assumptions of example_generating_inner_function
Check Strategy Chaining¶
If you specify multiple checks for a particular column, this is what happens under the hood:
The first check in the list is the base strategy, which
hypothesis
uses to generate data.All subsequent checks filter the values generated by the previous strategy such that it fulfills the constraints of current check.
To optimize efficiency of the data-generation procedure, make sure to specify the most restrictive constraint of a column as the base strategy and build other constraints on top of it.
In-line Custom Checks¶
One of the strengths of pandera
is its flexibility with regard to defining
custom checks on the fly:
schema_inline_check = pa.DataFrameSchema({
"col": pa.Column(str, pa.Check(lambda s: s.isin({"foo", "bar"})))
})
One of the disadvantages of this is that the fallback strategy is to simply
apply the check to the generated data, which can be highly inefficient. In this
case, hypothesis
will generate strings and try to find examples of strings
that are in the set {"foo", "bar"}
, which will be very slow and most likely
raise an Unsatisfiable
exception. To get around this limitation, you can
register custom checks and define strategies that correspond to them.
Defining Custom Strategies via the strategy
kwarg¶
The Check
constructor exposes a strategy
keyword argument that allows you to define a data synthesis strategy that can
work as a base strategy or chained strategy. For example, suppose you define
a custom check that makes sure values in a column are in some specified range.
check = pa.Check(lambda x: x.between(0, 100))
You can then define a strategy for this check with:
def in_range_strategy(pandera_dtype, strategy=None):
if strategy is None:
# handle base strategy case
return st.floats(min_value=min_val, max_value=max_val).map(
# the map isn't strictly necessary, but shows an example of
# using the pandera_dtype argument
strategies.to_numpy_dtype(pandera_dtype).type
)
# handle chained strategy case
return strategy.filter(lambda val: 0 <= val <= 10)
check = pa.Check(lambda x: x.between(0, 100), strategy=in_range_strategy)
Notice that the in_range_strategy
function takes two arguments: pandera_dtype
,
and strategy
. pandera_dtype
is required, since this is almost always
required information when generating data. The strategy
argument is optional,
where the default case assumes a base strategy, where the check is specified
as the first one in the list of checks specified at the column- or dataframe- level.
Defining Custom Strategies via Check Registration¶
All built-in Check
s are associated with a data
synthesis strategy. You can define your own data synthesis strategies by using
the extensions API to register a custom check function with
a corresponding strategy.