Narwhals¶
As of 0.32.0, Pandera ships an optional
Narwhals-based validation
backend that powers the Polars, Ibis, and
PySpark SQL integrations behind a single unified code
path. The Narwhals backend is opt-in: by default Pandera continues to use
the native Polars, Ibis, and PySpark backends. The public API
(import pandera.polars as pa, import pandera.ibis as pa,
import pandera.pyspark as pa) is unchanged regardless of which backend is
active.
Enabling the Narwhals backend¶
To switch the Polars, Ibis, and PySpark SQL integrations onto the
Narwhals-powered backend, install the narwhals extra and set the
PANDERA_USE_NARWHALS_BACKEND environment variable to True before
importing pandera.polars, pandera.ibis, or pandera.pyspark:
pip install 'pandera[narwhals]'
export PANDERA_USE_NARWHALS_BACKEND=True
You can also enable it programmatically by setting
pandera.config.CONFIG.use_narwhals_backend to True before any
pandera.polars / pandera.ibis / pandera.pyspark schema is constructed:
import pandera.config
pandera.config.CONFIG.use_narwhals_backend = True
import pandera.polars as pa # narwhals backend now registered
The backend choice is locked in the first time a Polars or Ibis schema is
created (the registration step is lru_cache-d). To switch backends in the
same process, clear the cache and re-register:
from pandera.backends.polars.register import register_polars_backends
from pandera.backends.ibis.register import register_ibis_backends
from pandera.backends.pyspark.register import register_pyspark_backends
register_polars_backends.cache_clear()
register_ibis_backends.cache_clear()
register_pyspark_backends.cache_clear()
If PANDERA_USE_NARWHALS_BACKEND=True but narwhals is not installed,
schema construction raises an ImportError directing you to install
pandera[narwhals].
What it is¶
Narwhals is a lightweight compatibility layer that provides a subset of the Polars expression API on top of multiple underlying DataFrame libraries (Polars, pandas, PyArrow, Modin, cuDF, Dask, Ibis, DuckDB, PySpark, etc.). Pandera uses it to express validation logic — column selection, type coercion, check evaluation, failure-case collection — once, and have it executed natively by each supported engine.
What it changes for you¶
Unified checks across Polars, Ibis, and PySpark SQL. Built-in checks (
isin,in_range,str_matches, etc.) are implemented as Narwhals expressions and run unchanged on Polars LazyFrames, Ibis tables, and PySpark SQL DataFrames when the Narwhals backend is enabled. PySpark SQL is a SQL-lazy backend: element-wise checks are not supported, and row sampling (sample=/tail=parameters) is not supported.Lazy validation stays lazy. For Polars LazyFrames, Ibis tables, and PySpark SQL DataFrames, Pandera threads validation through the native lazy API: no full-frame
.collect()/.execute()is triggered during validation. Only the boundedfailure_casesframe is materialized, and only on error.Custom checks become portable. A check written against
pandera.polarstypically works againstpandera.ibis(and vice versa) as long as it uses Narwhals expressions. Thenativeparameter onCheckcontrols which frame type the check function receives:native=True(the default) passes the native backend frame (e.g.pl.DataFrame,ibis.Table) so the check is backend-specific; settingnative=Falsepasses a Narwhals-wrapped frame so the check can run unchanged across all supported backends using only the Narwhals expression API.
PySpark SQL: differences from the native backend¶
Because the Narwhals backend for PySpark shares its check implementations with the Polars and Ibis backends, several behaviours differ from the native PySpark backend:
SQL-lazy execution. No element-wise checks (no
map_batcheson SQL-lazy frames), and no row sampling viasample=/tail=parameters.coerce=Trueis a no-op. The NarwhalsColumnBackendhas no coercion step. Settingcoerce=Trueon aFieldorColumnperforms no coercion; Pandera emits aSchemaWarningper column to make the subsequentWRONG_DATATYPEerror understandable rather than silent. Settingcoerce=Trueat theConfiglevel (row-wiseauto_coercedtype) is handled and does not warn. If you rely oncoerce=Trueto convert column dtypes, use the native PySpark backend (PANDERA_USE_NARWHALS_BACKEND=False).Custom checks using
PysparkDataframeColumnObjectare incompatible. Custom checks registered via@register_check_methodthat expect apyspark_obj: PysparkDataframeColumnObjectargument will not work under the Narwhals backend. The Narwhals backend passes aNarwhalsData(frame, key)named tuple to check functions instead, so the custom check signature and body must be rewritten against the Narwhals frame API (or kept on the native backend).failure_casesrows may be omitted for scalar Polars errors. Schema-level failure cases produced as scalar Polars frames (e.g. from a wrong-dtype check) are still reported in theerrorsdict but their rows are omitted from the aggregatedfailure_casesframe. See the Known gaps section for details.Unified
SchemaErrorscontract. Like the Polars and Ibis Narwhals backends, the PySpark Narwhals backend raisespandera.errors.SchemaErrorson validation failure (orSchemaErrorfor the first error whenlazy=False). This differs from the native PySpark backend, which attaches errors todataframe.pandera.errors. If you depend on thedataframe.pandera.errorsaccessor, use the native PySpark backend (PANDERA_USE_NARWHALS_BACKEND=False).
Opting out¶
The Narwhals backend is off by default, so no action is needed to
continue using the native Polars and Ibis backends. If you previously
opted in and want to switch back, unset the environment variable (or set
it to False):
unset PANDERA_USE_NARWHALS_BACKEND
# or
export PANDERA_USE_NARWHALS_BACKEND=False
The native paths remain fully supported alongside the Narwhals path.
Known gaps¶
A small number of features are currently not wired through the Narwhals backend. Follow-up milestones track each of the gaps below:
Under the PySpark Narwhals backend, schema-level
failure_casesproduced as scalar Polars frames (e.g. from a wrong-dtype error) are still reported in theerrorsdict but their rows are omitted from the aggregatedfailure_casesframe. This is because scalar Polars frames cannot be converted to PySpark without a liveSparkSessionat the error-collection site; this gap is tracked for a future release.Column-level
coerce=Trueis currently a no-op for all Narwhals backends (Polars, Ibis, PySpark SQL). Pandera emits a one-timeSchemaWarningper column so the subsequentWRONG_DATATYPEerror is understandable rather than silent. Full column-level coercion support is tracked as a follow-up.coercefor the Ibis backend (deferred;Ibiscoerces eagerly today)add_missing_columnsparser andset_defaultforColumnfieldsgroup_by-based checks beyond element-wise and column-wise expressionsElement-wise checks for SQL-lazy backends (Ibis and PySpark SQL). As a consequence, the shared built-in check suite in
tests/common/does not run for the PySpark Narwhals backend (all shared checks are element-wise; running them would produce only skips with no useful coverage signal).Schema IO (YAML/JSON) for Narwhals-backed schemas
Hypothesis data-synthesis strategies
sample=/tail=row sampling for SQL-lazy backends (Ibis and PySpark SQL)check_unique(column-level uniqueness) does not produce a per-row booleancheck_output, sodrop_invalid_rows=Truecannot filter rows that fail a uniqueness constraint — those rows remain in the output. This gap is tracked for a future release.
See the Supported DataFrame Libraries page for the user-facing integrations; the Narwhals layer is an implementation detail that keeps them consistent.