Data Validation with Ibis¶
new in 0.25.0
Ibis is an open-source dataframe library that works with any data system. You can use the same API for 20 backends, from fast local engines like DuckDB, Polars, and DataFusion to distributed data systems like BigQuery, Snowflake, and Databricks.
Usage¶
With the Ibis integration, you can define Pandera schemas to validate Ibis
tables in Python. First, install pandera with the ibis extra alongside
the Ibis backend that you’re using:
pip install 'pandera[ibis]' 'ibis-framework[duckdb]'
Note
You can find the command to install the Ibis backend of your choice on the Installation page of the Ibis documentation.
Then, you can start validating Ibis tables using Pandera schemas. In the example
below, we’ll use the class-based API to define a
DataFrameModel, which we’ll then use to
validate an ibis.Table object.
import ibis
import pandera.ibis as pa
class Schema(pa.DataFrameModel):
state: str
city: str
price: int = pa.Field(in_range={"min_value": 5, "max_value": 20})
t = ibis.memtable(
{
'state': ['FL','FL','FL','CA','CA','CA'],
'city': [
'Orlando',
'Miami',
'Tampa',
'San Francisco',
'Los Angeles',
'San Diego',
],
'price': [8, 12, 10, 16, 20, 18],
}
)
Schema.validate(t).execute()
| state | city | price | |
|---|---|---|---|
| 0 | FL | Orlando | 8 |
| 1 | FL | Miami | 12 |
| 2 | FL | Tampa | 10 |
| 3 | CA | San Francisco | 16 |
| 4 | CA | Los Angeles | 20 |
| 5 | CA | San Diego | 18 |
You can also use the check_types() decorator to
validate Ibis table function annotations at runtime:
from pandera.typing.ibis import Table
@pa.check_types
def function(t: Table[Schema]) -> Table[Schema]:
return t.filter(t.state == "CA")
function(t).execute()
| state | city | price | |
|---|---|---|---|
| 0 | CA | San Francisco | 16 |
| 1 | CA | Los Angeles | 20 |
| 2 | CA | San Diego | 18 |
And of course, you can use the object-based API to define a
DataFrameSchema:
schema = pa.DataFrameSchema({
"state": pa.Column(str),
"city": pa.Column(str),
"price": pa.Column(int, pa.Check.in_range(min_value=5, max_value=20))
})
schema.validate(t).execute()
| state | city | price | |
|---|---|---|---|
| 0 | FL | Orlando | 8 |
| 1 | FL | Miami | 12 |
| 2 | FL | Tampa | 10 |
| 3 | CA | San Francisco | 16 |
| 4 | CA | Los Angeles | 20 |
| 5 | CA | San Diego | 18 |
Synthesizing data for testing¶
Warning
The Data Synthesis Strategies functionality is not yet supported in
the Ibis integration. At this time, you can use other frameworks to generate
test data for Ibis. For example, you can use the polars-native
parametric testing
functions to producing Polars DataFrames or LazyFrames, from which you can
construct ibis.memtables.
How it works¶
Compared to the way pandera handles pandas dataframes, the Ibis backend for
pandera leverages the fact that Ibis tables are lazy.
At a high level, this is what happens during schema validation:
Apply parsers: add missing columns if
add_missing_columns=True, coerce the datatypes ifcoerce=True, filter columns ifstrict="filter", and set defaults ifdefault=<value>.Apply checks: run all core, built-in, and custom checks on the data. Checks on metadata are done without
.execute()operations, but checks that inspect data values do.Raise an error: if data errors are found, a
SchemaErroris raised. Ifvalidate(..., lazy=True), aSchemaErrorsexception is raised with all of the validation errors present in the data.Return validated output: if no data errors are found, the validated object is returned.
pandera’s validation behavior aligns with the way ibis handles lazy
vs. eager operations. When you call schema.validate() on an Ibis table,
pandera will apply all of the parsers and checks that can be done without
any execute() operations. This means that it only does validations
at the schema-level, e.g. column names and data types.
Method Chaining¶
Using DataFrameSchema¶
import ibis
import pandera.ibis as pa
schema = pa.DataFrameSchema({"a": pa.Column(int)})
df = (
ibis.memtable({"a": [1.0, 2.0, 3.0]})
.cast({"a": "int64"})
.pipe(schema.validate) # this validates schema- and data-level properties
.mutate(b=ibis.literal("a"))
# do more lazy operations
.execute()
)
print(df)
a b
0 1 a
1 2 a
2 3 a
Using DataFrameModel¶
import ibis
import pandera.ibis as pa
class SimpleModel(pa.DataFrameModel):
a: int
df = (
ibis.memtable({"a": [1.0, 2.0, 3.0]})
.cast({"a": "int64"})
.pipe(SimpleModel.validate) # this validates schema- and data-level properties
.mutate(b=ibis.literal("a"))
# do more lazy operations
.execute()
)
print(df)
a b
0 1 a
1 2 a
2 3 a
Error Reporting¶
In the event of a validation error, pandera will raise a SchemaError
eagerly.
class SimpleModel(pa.DataFrameModel):
a: int
invalid_t = ibis.memtable({"a": ["1", "2", "3"]})
try:
SimpleModel.validate(invalid_t)
except pa.errors.SchemaError as exc:
print(exc)
expected column 'a' to have type int64, got string
And if you use lazy validation, pandera will raise a SchemaErrors
exception. This is particularly useful when you want to collect all of the validation errors
present in the data.
By default, Pandera will validate both schema- and data-level properties:
class ModelWithChecks(pa.DataFrameModel):
a: int
b: str = pa.Field(isin=[*"abc"])
c: float = pa.Field(ge=0.0, le=1.0)
invalid_t = ibis.memtable({
"a": ["1", "2", "3"],
"b": ["d", "e", "f"],
"c": [0.0, 1.1, -0.1],
})
ModelWithChecks.validate(invalid_t, lazy=True)
---------------------------------------------------------------------------
SchemaErrors Traceback (most recent call last)
Cell In[7], line 11
4 c: float = pa.Field(ge=0.0, le=1.0)
6 invalid_t = ibis.memtable({
7 "a": ["1", "2", "3"],
8 "b": ["d", "e", "f"],
9 "c": [0.0, 1.1, -0.1],
10 })
---> 11 ModelWithChecks.validate(invalid_t, lazy=True)
File ~/checkouts/readthedocs.org/user_builds/pandera/checkouts/v0.29.0/pandera/api/ibis/model.py:132, in DataFrameModel.validate(cls, check_obj, head, tail, sample, random_state, lazy, inplace)
119 @classmethod
120 @docstring_substitution(validate_doc=BaseSchema.validate.__doc__)
121 def validate(
(...) 129 inplace: bool = False,
130 ) -> Table[Self]:
131 """%(validate_doc)s"""
--> 132 result = cls.to_schema().validate(
133 check_obj, head, tail, sample, random_state, lazy, inplace
134 )
135 return cast(Table[Self], result)
File ~/checkouts/readthedocs.org/user_builds/pandera/checkouts/v0.29.0/pandera/api/ibis/container.py:87, in DataFrameSchema.validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
20 def validate(
21 self,
22 check_obj: ibis.Table,
(...) 28 inplace: bool = False,
29 ) -> ibis.Table:
30 """Validate an Ibis table against the schema.
31
32 :param ibis.Table check_obj: the table to be validated.
(...) 84 5 0.76 dog
85 """
---> 87 return self.get_backend(check_obj).validate(
88 check_obj=check_obj,
89 schema=self,
90 head=head,
91 tail=tail,
92 sample=sample,
93 random_state=random_state,
94 lazy=lazy,
95 inplace=inplace,
96 )
File ~/checkouts/readthedocs.org/user_builds/pandera/checkouts/v0.29.0/pandera/backends/ibis/container.py:127, in DataFrameSchemaBackend.validate(self, check_obj, schema, head, tail, sample, random_state, lazy, inplace)
125 return check_obj
126 else:
--> 127 raise SchemaErrors(
128 schema=schema,
129 schema_errors=error_handler.schema_errors,
130 data=check_obj,
131 )
133 return check_obj
SchemaErrors: {
"SCHEMA": {
"WRONG_DATATYPE": [
{
"schema": "ModelWithChecks",
"column": "a",
"check": "dtype('int64')",
"error": "expected column 'a' to have type int64, got string"
}
]
},
"DATA": {
"DATAFRAME_CHECK": [
{
"schema": "ModelWithChecks",
"column": "b",
"check": "isin(['a', 'b', 'c'])",
"error": "Column 'b' failed element-wise validator number 0: isin(['a', 'b', 'c']) failure cases: {'b': 'd'}, {'b': 'e'}, {'b': 'f'}"
},
{
"schema": "ModelWithChecks",
"column": "c",
"check": "greater_than_or_equal_to(0.0)",
"error": "Column 'c' failed element-wise validator number 0: greater_than_or_equal_to(0.0) failure cases: {'c': -0.1}"
},
{
"schema": "ModelWithChecks",
"column": "c",
"check": "less_than_or_equal_to(1.0)",
"error": "Column 'c' failed element-wise validator number 1: less_than_or_equal_to(1.0) failure cases: {'c': 1.1}"
}
]
}
}
Supported Data Types¶
pandera currently supports most of the
Ibis data types.
Built-in Python types like str, int, float, and bool will be
handled in the same way that Ibis handles them:
schema1 = ibis.schema({"x": int, "y": str, "z": float})
schema2 = ibis.schema({"x": "int64", "y": "string", "z": "float64"})
assert schema1 == schema2
So the following schemas are equivalent:
schema1 = pa.DataFrameSchema({
"a": pa.Column(int),
"b": pa.Column(str),
"c": pa.Column(float),
})
schema2 = pa.DataFrameSchema({
"a": pa.Column(ibis.dtype("int64")),
"b": pa.Column(ibis.dtype("string")),
"c": pa.Column(ibis.dtype("float64")),
})
assert schema1 == schema2
Nested Types¶
Warning
Using parameterized data types for nested Ibis data types is not yet supported in the Ibis integration.
Time-agnostic DateTime¶
In some use cases, it may not matter whether a column containing timestamp data has a timezone or not.
Warning
The time_zone_agnostic argument for the timestamp data type is not yet
supported in the Ibis integration.
Custom checks¶
All of the built-in Check methods are supported
in the Ibis integration.
To create custom checks, you can create functions that take a IbisData
named tuple as input and produces a ibis.Table as output. IbisData
contains two attributes:
A
tableattribute, which contains theibis.Tableobject you want to validate.A
keyattribute, which contains the column name you want to validate. This will beNonefor table-level checks.
Element-wise checks are also supported by setting element_wise=True. This
will require a function that takes in a single element of the column/dataframe
and returns a boolean scalar indicating whether the value passed.
Warning
Under the hood, element-wise checks use Python UDFs, which are likely to be much slower than vectorized checks.
Column-level Checks¶
For column-level checks, the custom check function should return an Ibis table containing a single boolean column or a single boolean scalar.
Here’s an example of a column-level custom check:
Using DataFrameSchema¶
from pandera.ibis import IbisData
def is_positive_vector(data: IbisData) -> ibis.Table:
"""Return a table with a single boolean column."""
return data.table.select(data.table[data.key] > 0)
def is_positive_scalar(data: IbisData) -> ibis.Table:
"""Return a table with a single boolean scalar."""
return data.table[data.key] > 0
def is_positive_element_wise(x: int) -> bool:
"""Take a single value and return a boolean scalar."""
return x > 0
schema_with_custom_checks = pa.DataFrameSchema({
"a": pa.Column(
int,
checks=[
pa.Check(is_positive_vector),
pa.Check(is_positive_scalar),
pa.Check(is_positive_element_wise, element_wise=True),
]
)
})
t = ibis.memtable({"a": [1, 2, 3]})
validated_t = t.pipe(schema_with_custom_checks.validate)
print(validated_t)
InMemoryTable
data:
PandasDataFrameProxy:
a
0 1
1 2
2 3
Using DataFrameModel¶
from pandera.ibis import IbisData
class ModelWithCustomChecks(pa.DataFrameModel):
a: int
@pa.check("a")
def is_positive_vector(cls, data: IbisData) -> ibis.Table:
"""Return a table with a single boolean column."""
return data.table.select(data.table[data.key] > 0)
@pa.check("a")
def is_positive_scalar(cls, data: IbisData) -> ibis.Table:
"""Return a table with a single boolean scalar."""
return data.table[data.key] > 0
t = ibis.memtable({"a": [1, 2, 3]})
validated_t = t.pipe(ModelWithCustomChecks.validate)
print(validated_t)
InMemoryTable
data:
PandasDataFrameProxy:
a
0 1
1 2
2 3
Warning
Element-wise checks using DataFrameModel are not yet supported in
the Ibis integration; use DataFrameSchema instead.
DataFrame-level Checks¶
If you need to validate values on an entire dataframe, you can specify a check at the dataframe level. The expected output is an Ibis table containing multiple boolean columns, a single boolean column, or a scalar boolean.
Using DataFrameSchema¶
from ibis import _, selectors as s
def col1_gt_col2(data: IbisData, col1: str, col2: str) -> ibis.Table:
"""Return a table with a single boolean column."""
return data.table.select(data.table[col1] > data.table[col2])
def is_positive_df(data: IbisData) -> ibis.Table:
"""Return a table with multiple boolean columns."""
return data.table.select(s.across(s.all(), _ > 0))
def is_positive_element_wise(x: int) -> bool:
"""Take a single value and return a boolean scalar."""
return x > 0
schema_with_df_checks = pa.DataFrameSchema(
columns={
"a": pa.Column(int),
"b": pa.Column(int),
},
checks=[
pa.Check(col1_gt_col2, col1="a", col2="b"),
pa.Check(is_positive_df),
pa.Check(is_positive_element_wise, element_wise=True),
]
)
t = ibis.memtable({"a": [2, 3, 4], "b": [1, 2, 3]})
validated_t = t.pipe(schema_with_df_checks.validate)
print(validated_t)
InMemoryTable
data:
PandasDataFrameProxy:
a b
0 2 1
1 3 2
2 4 3
Using DataFrameModel¶
class ModelWithDFChecks(pa.DataFrameModel):
a: int
b: int
@pa.dataframe_check
def cola_gt_colb(cls, data: IbisData) -> ibis.Table:
"""Return a table with a single boolean column."""
return data.table.select(data.table["a"] > data.table["b"])
@pa.dataframe_check
def is_positive_df(cls, data: IbisData) -> ibis.Table:
"""Return a table with multiple boolean columns."""
return data.table.select(s.across(s.all(), _ > 0))
t = ibis.memtable({"a": [2, 3, 4], "b": [1, 2, 3]})
validated_t = t.pipe(ModelWithDFChecks.validate)
print(validated_t)
InMemoryTable
data:
PandasDataFrameProxy:
a b
0 2 1
1 3 2
2 4 3
Warning
Element-wise checks using DataFrameModel are not yet supported in
the Ibis integration; use DataFrameSchema instead.
Supported and Unsupported Functionality¶
Since the Pandera-Ibis integration is less mature than pandas support, some of the functionality offered by the pandera with pandas DataFrames are not yet supported with Ibis tables.
Here is a list of supported and unsupported features. You can refer to the supported features matrix to see which features are implemented in the Ibis validation backend.