Data Validation with PolarsΒΆ
new in 0.19.0
Polars is a blazingly fast DataFrame library for manipulating structured data. Since the core is written in Rust, you get the performance of C/C++ while providing SDKs in other languages like Python.
UsageΒΆ
With the polars integration, you can define pandera schemas to validate polars
dataframes in Python. First, install pandera
with the polars
extra:
pip install 'pandera[polars]'
Important
If youβre on an Apple Silicon machine, youβll need to install polars via
pip install polars-lts-cpu
.
Then you can use pandera schemas to validate polars dataframes. In the example
below weβll use the class-based API to define a
DataFrameModel
, which we then use to
validate a polars.LazyFrame
object.
import pandera.polars as pa
import polars as pl
class Schema(pa.DataFrameModel):
state: str
city: str
price: int = pa.Field(in_range={"min_value": 5, "max_value": 20})
lf = pl.LazyFrame(
{
'state': ['FL','FL','FL','CA','CA','CA'],
'city': [
'Orlando',
'Miami',
'Tampa',
'San Francisco',
'Los Angeles',
'San Diego',
],
'price': [8, 12, 10, 16, 20, 18],
}
)
Schema.validate(lf).collect()
state | city | price |
---|---|---|
str | str | i64 |
"FL" | "Orlando" | 8 |
"FL" | "Miami" | 12 |
"FL" | "Tampa" | 10 |
"CA" | "San Francisco" | 16 |
"CA" | "Los Angeles" | 20 |
"CA" | "San Diego" | 18 |
You can also use the check_types()
decorator to
validate polars LazyFrame function annotations at runtime:
from pandera.typing.polars import LazyFrame
@pa.check_types
def function(lf: LazyFrame[Schema]) -> LazyFrame[Schema]:
return lf.filter(pl.col("state").eq("CA"))
function(lf).collect()
state | city | price |
---|---|---|
str | str | i64 |
"CA" | "San Francisco" | 16 |
"CA" | "Los Angeles" | 20 |
"CA" | "San Diego" | 18 |
And of course, you can use the object-based API to define a
DataFrameSchema
:
schema = pa.DataFrameSchema({
"state": pa.Column(str),
"city": pa.Column(str),
"price": pa.Column(int, pa.Check.in_range(min_value=5, max_value=20))
})
schema.validate(lf).collect()
state | city | price |
---|---|---|
str | str | i64 |
"FL" | "Orlando" | 8 |
"FL" | "Miami" | 12 |
"FL" | "Tampa" | 10 |
"CA" | "San Francisco" | 16 |
"CA" | "Los Angeles" | 20 |
"CA" | "San Diego" | 18 |
You can also validate polars.DataFrame
objects, which are objects that
execute computations eagerly. Under the hood, pandera
will convert
the polars.DataFrame
to a polars.LazyFrame
before validating it:
df = lf.collect()
schema.validate(df)
state | city | price |
---|---|---|
str | str | i64 |
"FL" | "Orlando" | 8 |
"FL" | "Miami" | 12 |
"FL" | "Tampa" | 10 |
"CA" | "San Francisco" | 16 |
"CA" | "Los Angeles" | 20 |
"CA" | "San Diego" | 18 |
Note
The Data Synthesis Strategies functionality is not yet supported in the polars integration. At this time you can use the polars-native parametric testing functions to generate test data for polars.
How it worksΒΆ
Compared to the way pandera
handles pandas
dataframes, pandera
attempts to leverage the polars
lazy API
as much as possible to leverage its performance optimization benefits.
At a high level, this is what happens during schema validation:
Apply parsers: add missing columns if
add_missing_columns=True
, coerce the datatypes ifcoerce=True
, filter columns ifstrict="filter"
, and set defaults ifdefault=<value>
.Apply checks: run all core, built-in, and custom checks on the data. Checks on metadata are done without
.collect()
operations, but checks that inspect data values do.Raise an error: if data errors are found, a
SchemaError
is raised. Ifvalidate(..., lazy=True)
, aSchemaErrors
exception is raised with all of the validation errors present in the data.Return validated output: if no data errors are found, the validated object is returned
Note
Datatype coercion on pl.LazyFrame
objects are done without .collect()
operations, but coercion on pl.DataFrame
will, resulting in more
informative error messages since all failure cases can be reported.
pandera
βs validation behavior aligns with the way polars
handles lazy
vs. eager operations. When you can schema.validate()
on a polars.LazyFrame
,
pandera
will apply all of the parsers and checks that can be done without
any collect()
operations. This means that it only does validations
at the schema-level, e.g. column names and data types.
However, if you validate a polars.DataFrame
, pandera
perform
schema-level and data-level validations.
Note
Under the hood, pandera
will convert polars.DataFrame``s to a ``polars.LazyFrame``s before validating them. This is done to leverage the polars lazy API during the validation process. While this feature isn't fully optimized in the ``pandera
library, this design decision lays the
ground-work for future performance improvements.
LazyFrame
Method ChainΒΆ
import pandera.polars as pa
import polars as pl
schema = pa.DataFrameSchema({"a": pa.Column(int)})
df = (
pl.LazyFrame({"a": [1.0, 2.0, 3.0]})
.cast({"a": pl.Int64})
.pipe(schema.validate) # this only validates schema-level properties
.with_columns(b=pl.lit("a"))
# do more lazy operations
.collect()
)
print(df)
shape: (3, 2)
βββββββ¬ββββββ
β a β b β
β --- β --- β
β i64 β str β
βββββββͺββββββ‘
β 1 β a β
β 2 β a β
β 3 β a β
βββββββ΄ββββββ
import pandera.polars as pa
import polars as pl
class SimpleModel(pa.DataFrameModel):
a: int
df = (
pl.LazyFrame({"a": [1.0, 2.0, 3.0]})
.cast({"a": pl.Int64})
.pipe(SimpleModel.validate) # this only validates schema-level properties
.with_columns(b=pl.lit("a"))
# do more lazy operations
.collect()
)
print(df)
shape: (3, 2)
βββββββ¬ββββββ
β a β b β
β --- β --- β
β i64 β str β
βββββββͺββββββ‘
β 1 β a β
β 2 β a β
β 3 β a β
βββββββ΄ββββββ
DataFrame
Method ChainΒΆ
schema = pa.DataFrameSchema({"a": pa.Column(int)})
df = (
pl.DataFrame({"a": [1.0, 2.0, 3.0]})
.cast({"a": pl.Int64})
.pipe(schema.validate) # this validates schema- and data- level properties
.with_columns(b=pl.lit("a"))
# do more eager operations
)
print(df)
shape: (3, 2)
βββββββ¬ββββββ
β a β b β
β --- β --- β
β i64 β str β
βββββββͺββββββ‘
β 1 β a β
β 2 β a β
β 3 β a β
βββββββ΄ββββββ
class SimpleModel(pa.DataFrameModel):
a: int
df = (
pl.DataFrame({"a": [1.0, 2.0, 3.0]})
.cast({"a": pl.Int64})
.pipe(SimpleModel.validate) # this validates schema- and data- level properties
.with_columns(b=pl.lit("a"))
# do more eager operations
)
print(df)
shape: (3, 2)
βββββββ¬ββββββ
β a β b β
β --- β --- β
β i64 β str β
βββββββͺββββββ‘
β 1 β a β
β 2 β a β
β 3 β a β
βββββββ΄ββββββ
Error ReportingΒΆ
In the event of a validation error, pandera
will raise a SchemaError
eagerly.
class SimpleModel(pa.DataFrameModel):
a: int
invalid_lf = pl.LazyFrame({"a": pl.Series(["1", "2", "3"], dtype=pl.Utf8)})
try:
SimpleModel.validate(invalid_lf)
except pa.errors.SchemaError as exc:
print(exc)
expected column 'a' to have type Int64, got String
And if you use lazy validation, pandera
will raise a SchemaErrors
exception. This is particularly useful when you want to collect all of the validation errors
present in the data.
Note
Lazy validation in pandera is different from the
lazy API in polars, which is an unfortunate name collision. Lazy validation
means that all parsers and checks are applied to the data before raising
a SchemaErrors
exception. The lazy API
in polars allows you to build a computation graph without actually
executing it in-line, where you call .collect()
to actually execute
the computation.
By default, pl.LazyFrame
validation will only validate schema-level properties:
class ModelWithChecks(pa.DataFrameModel):
a: int
b: str = pa.Field(isin=[*"abc"])
c: float = pa.Field(ge=0.0, le=1.0)
invalid_lf = pl.LazyFrame({
"a": pl.Series(["1", "2", "3"], dtype=pl.Utf8),
"b": ["d", "e", "f"],
"c": [0.0, 1.1, -0.1],
})
ModelWithChecks.validate(invalid_lf, lazy=True)
Traceback (most recent call last):
...
pandera.errors.SchemaErrors: {
"SCHEMA": {
"WRONG_DATATYPE": [
{
"schema": "ModelWithChecks",
"column": "a",
"check": "dtype('Int64')",
"error": "expected column 'a' to have type Int64, got String"
}
]
}
}
By default, pl.DataFrame
validation will validate both schema-level
and data-level properties:
class ModelWithChecks(pa.DataFrameModel):
a: int
b: str = pa.Field(isin=[*"abc"])
c: float = pa.Field(ge=0.0, le=1.0)
invalid_lf = pl.DataFrame({
"a": pl.Series(["1", "2", "3"], dtype=pl.Utf8),
"b": ["d", "e", "f"],
"c": [0.0, 1.1, -0.1],
})
ModelWithChecks.validate(invalid_lf, lazy=True)
Traceback (most recent call last):
...
pandera.errors.SchemaErrors: {
"SCHEMA": {
"WRONG_DATATYPE": [
{
"schema": "ModelWithChecks",
"column": "a",
"check": "dtype('Int64')",
"error": "expected column 'a' to have type Int64, got String"
}
]
},
"DATA": {
"DATAFRAME_CHECK": [
{
"schema": "ModelWithChecks",
"column": "b",
"check": "isin(['a', 'b', 'c'])",
"error": "Column 'b' failed validator number 0: <Check isin: isin(['a', 'b', 'c'])> failure case examples: [{'b': 'd'}, {'b': 'e'}, {'b': 'f'}]"
},
{
"schema": "ModelWithChecks",
"column": "c",
"check": "greater_than_or_equal_to(0.0)",
"error": "Column 'c' failed validator number 0: <Check greater_than_or_equal_to: greater_than_or_equal_to(0.0)> failure case examples: [{'c': -0.1}]"
},
{
"schema": "ModelWithChecks",
"column": "c",
"check": "less_than_or_equal_to(1.0)",
"error": "Column 'c' failed validator number 1: <Check less_than_or_equal_to: less_than_or_equal_to(1.0)> failure case examples: [{'c': 1.1}]"
}
]
}
}
Supported Data TypesΒΆ
pandera
currently supports all of the
polars data types.
Built-in python types like str
, int
, float
, and bool
will be
handled in the same way that polars
handles them:
assert pl.Series([1,2,3], dtype=int).dtype == pl.Int64
assert pl.Series([*"abc"], dtype=str).dtype == pl.Utf8
assert pl.Series([1.0, 2.0, 3.0], dtype=float).dtype == pl.Float64
So the following schemas are equivalent:
schema1 = pa.DataFrameSchema({
"a": pa.Column(int),
"b": pa.Column(str),
"c": pa.Column(float),
})
schema2 = pa.DataFrameSchema({
"a": pa.Column(pl.Int64),
"b": pa.Column(pl.Utf8),
"c": pa.Column(pl.Float64),
})
assert schema1 == schema2
Nested TypesΒΆ
Polars nested datetypes are also supported via parameterized data types. See the examples below for the different ways to specify this through the object-based and class-based APIs:
schema = pa.DataFrameSchema(
{
"list_col": pa.Column(pl.List(pl.Int64())),
"array_col": pa.Column(pl.Array(pl.Int64(), 3)),
"struct_col": pa.Column(pl.Struct({"a": pl.Utf8(), "b": pl.Float64()})),
},
)
try:
from typing import Annotated # python 3.9+
except ImportError:
from typing_extensions import Annotated
class ModelWithAnnotated(pa.DataFrameModel):
list_col: Annotated[pl.List, pl.Int64()]
array_col: Annotated[pl.Array, pl.Int64(), 3]
struct_col: Annotated[pl.Struct, {"a": pl.Utf8(), "b": pl.Float64()}]
class ModelWithDtypeKwargs(pa.DataFrameModel):
list_col: pl.List = pa.Field(dtype_kwargs={"inner": pl.Int64()})
array_col: pl.Array = pa.Field(dtype_kwargs={"inner": pl.Int64(), "width": 3})
struct_col: pl.Struct = pa.Field(dtype_kwargs={"fields": {"a": pl.Utf8(), "b": pl.Float64()}})
Custom checksΒΆ
All of the built-in Check
methods are supported
in the polars integration.
To create custom checks, you can create functions that take a PolarsData
named tuple as input and produces a polars.LazyFrame
as output. PolarsData
contains two attributes:
A
lazyframe
attribute, which contains thepolars.LazyFrame
object you want to validate.A
key
attribute, which contains the column name you want to validate. This will beNone
for dataframe-level checks.
Element-wise checks are also supported by setting element_wise=True
. This
will require a function that takes in a single element of the column/dataframe
and returns a boolean scalar indicating whether the value passed.
Warning
Under the hood, element-wise checks use the map_elements function, which is slower than the native polars expressions API.
Column-level ChecksΒΆ
Hereβs an example of a column-level custom check:
from pandera.polars import PolarsData
def is_positive_vector(data: PolarsData) -> pl.LazyFrame:
"""Return a LazyFrame with a single boolean column."""
return data.lazyframe.select(pl.col(data.key).gt(0))
def is_positive_scalar(data: PolarsData) -> pl.LazyFrame:
"""Return a LazyFrame with a single boolean scalar."""
return data.lazyframe.select(pl.col(data.key).gt(0).all())
def is_positive_element_wise(x: int) -> bool:
"""Take a single value and return a boolean scalar."""
return x > 0
schema_with_custom_checks = pa.DataFrameSchema({
"a": pa.Column(
int,
checks=[
pa.Check(is_positive_vector),
pa.Check(is_positive_scalar),
pa.Check(is_positive_element_wise, element_wise=True),
]
)
})
lf = pl.LazyFrame({"a": [1, 2, 3]})
validated_df = lf.collect().pipe(schema_with_custom_checks.validate)
print(validated_df)
shape: (3, 1)
βββββββ
β a β
β --- β
β i64 β
βββββββ‘
β 1 β
β 2 β
β 3 β
βββββββ
from pandera.polars import PolarsData
class ModelWithCustomChecks(pa.DataFrameModel):
a: int
@pa.check("a")
def is_positive_vector(cls, data: PolarsData) -> pl.LazyFrame:
"""Return a LazyFrame with a single boolean column."""
return data.lazyframe.select(pl.col(data.key).gt(0))
@pa.check("a")
def is_positive_scalar(cls, data: PolarsData) -> pl.LazyFrame:
"""Return a LazyFrame with a single boolean scalar."""
return data.lazyframe.select(pl.col(data.key).gt(0).all())
@pa.check("a", element_wise=True)
def is_positive_element_wise(cls, x: int) -> bool:
"""Take a single value and return a boolean scalar."""
return x > 0
validated_df = lf.collect().pipe(ModelWithCustomChecks.validate)
print(validated_df)
shape: (3, 1)
βββββββ
β a β
β --- β
β i64 β
βββββββ‘
β 1 β
β 2 β
β 3 β
βββββββ
For column-level checks, the custom check function should return a
polars.LazyFrame
containing a single boolean column or a single boolean scalar.
DataFrame-level ChecksΒΆ
If you need to validate values on an entire dataframe, you can specify at check
at the dataframe level. The expected output is a polars.LazyFrame
containing
multiple boolean columns, a single boolean column, or a scalar boolean.
def col1_gt_col2(data: PolarsData, col1: str, col2: str) -> pl.LazyFrame:
"""Return a LazyFrame with a single boolean column."""
return data.lazyframe.select(pl.col(col1).gt(pl.col(col2)))
def is_positive_df(data: PolarsData) -> pl.LazyFrame:
"""Return a LazyFrame with multiple boolean columns."""
return data.lazyframe.select(pl.col("*").gt(0))
def is_positive_element_wise(x: int) -> bool:
"""Take a single value and return a boolean scalar."""
return x > 0
schema_with_df_checks = pa.DataFrameSchema(
columns={
"a": pa.Column(int),
"b": pa.Column(int),
},
checks=[
pa.Check(col1_gt_col2, col1="a", col2="b"),
pa.Check(is_positive_df),
pa.Check(is_positive_element_wise, element_wise=True),
]
)
lf = pl.LazyFrame({"a": [2, 3, 4], "b": [1, 2, 3]})
validated_df = lf.collect().pipe(schema_with_df_checks.validate)
print(validated_df)
shape: (3, 2)
βββββββ¬ββββββ
β a β b β
β --- β --- β
β i64 β i64 β
βββββββͺββββββ‘
β 2 β 1 β
β 3 β 2 β
β 4 β 3 β
βββββββ΄ββββββ
class ModelWithDFChecks(pa.DataFrameModel):
a: int
b: int
@pa.dataframe_check
def cola_gt_colb(cls, data: PolarsData) -> pl.LazyFrame:
"""Return a LazyFrame with a single boolean column."""
return data.lazyframe.select(pl.col("a").gt(pl.col("b")))
@pa.dataframe_check
def is_positive_df(cls, data: PolarsData) -> pl.LazyFrame:
"""Return a LazyFrame with multiple boolean columns."""
return data.lazyframe.select(pl.col("*").gt(0))
@pa.dataframe_check(element_wise=True)
def is_positive_element_wise(cls, x: int) -> bool:
"""Take a single value and return a boolean scalar."""
return x > 0
validated_df = lf.collect().pipe(ModelWithDFChecks.validate)
print(validated_df)
shape: (3, 2)
βββββββ¬ββββββ
β a β b β
β --- β --- β
β i64 β i64 β
βββββββͺββββββ‘
β 2 β 1 β
β 3 β 2 β
β 4 β 3 β
βββββββ΄ββββββ
Data-level Validation with LazyFramesΒΆ
As mentioned earlier in this page, by default calling schema.validate
on
a pl.LazyFrame
will only perform schema-level validation checks. If you want
to validate data-level properties on a pl.LazyFrame
, the recommended way
would be to first call .collect()
:
class SimpleModel(pa.DataFrameModel):
a: int
lf: pl.LazyFrame = (
pl.LazyFrame({"a": [1.0, 2.0, 3.0]})
.cast({"a": pl.Int64})
.collect() # convert to pl.DataFrame
.pipe(SimpleModel.validate)
.lazy() # convert back to pl.LazyFrame
# do more lazy operations
)
This syntax is nice because itβs clear whatβs happening just from reading the code. Pandera schemas serve as an apparent point in the method chain that materializes data.
However, if you donβt mind a little magic πͺ, you can set the
PANDERA_VALIDATION_DEPTH
variable to SCHEMA_AND_DATA
to
validate data-level properties on a polars.LazyFrame
. This will be equivalent
to the explicit code above:
export PANDERA_VALIDATION_DEPTH=SCHEMA_AND_DATA
lf: pl.LazyFrame = (
pl.LazyFrame({"a": [1.0, 2.0, 3.0]})
.cast({"a": pl.Int64})
.pipe(SimpleModel.validate) # this will validate schema- and data-level properties
# do more lazy operations
)
Under the hood, the validation process will make .collect()
calls on the
LazyFrame in order to run data-level validation checks, and it will still
return a pl.LazyFrame
after validation is done.