Narwhals¶
As of 0.32.0, Pandera ships an optional
Narwhals-based validation
backend that powers the Polars, Ibis, and
PySpark SQL integrations behind a single unified code
path. The Narwhals backend is opt-in: by default Pandera continues to use
the native Polars, Ibis, and PySpark backends. The public API
(import pandera.polars as pa, import pandera.ibis as pa,
import pandera.pyspark as pa) is unchanged regardless of which backend is
active.
Enabling the Narwhals backend¶
The Narwhals backend is opt-in. Install the narwhals extra alongside the
backend(s) you use:
pip install 'pandera[narwhals,polars]' # Polars
pip install 'pandera[narwhals,ibis]' # Ibis
pip install 'pandera[narwhals,pyspark]' # PySpark SQL
Then enable it using either of the following options.
Environment variable (process start)¶
Set PANDERA_USE_NARWHALS_BACKEND to True before starting Python:
export PANDERA_USE_NARWHALS_BACKEND=True
python your_script.py
This value is read when pandera.config is first imported.
Programmatic configuration¶
Call set_config() at any point — before or after importing
pandera.polars, pandera.ibis, or pandera.pyspark:
import pandera
pandera.set_config(use_narwhals_backend=True)
import pandera.polars as pa
See Backend registration for details on when backends are registered and how runtime toggling works.
Advanced: manual re-registration¶
Prefer set_config() to toggle backends within a process. For
low-level control (for example, in tests), clear the registration caches and
call the register functions with the desired flag:
from pandera.backends.polars.register import register_polars_backends
register_polars_backends.cache_clear()
register_polars_backends(use_narwhals_backend=True)
The same pattern applies to register_ibis_backends and
register_pyspark_backends.
If PANDERA_USE_NARWHALS_BACKEND=True but narwhals is not installed,
schema construction raises an ImportError directing you to install
pandera[narwhals].
Backend registration¶
Pandera chooses between the native and Narwhals validation backends through a
registration step that maps each schema class (for example,
DataFrameSchema) to a concrete
backend implementation for a given frame type (for example, polars.DataFrame).
Two behaviours govern how that mapping is established and updated at runtime.
Lazy registration¶
Validation backends for Polars, Ibis, and PySpark SQL are registered lazily — not when you import a pandera backend module, but the first time a schema needs a backend. Concretely, registration runs when you:
construct a
DataFrameSchema,DataFrameSchema, orDataFrameSchema, orcall
validate()on a column or schema component that triggers backend lookup.
Until one of those happens, importing pandera.polars, pandera.ibis, or
pandera.pyspark has no effect on which validation backend is active:
import pandera.polars as pa
# CONFIG.use_narwhals_backend is read here — not at import time above
pa.config.set_config(use_narwhals_backend=True)
schema = pa.DataFrameSchema({"name": pa.Column(str)}) # narwhals backends registered
schema.validate(df)
At registration time, pandera reads the current value of
CONFIG.use_narwhals_backend (from the environment variable or a prior
set_config() call) and registers either the native or Narwhals
backend implementations. The register functions are cached with
@lru_cache; the use_narwhals_backend flag is part of the cache key, so
native and Narwhals registrations do not collide.
Tip
Because registration is lazy, you can call set_config() after
importing a backend module and before constructing your first schema — no
manual cache clearing is required in that case.
Runtime re-registration¶
If you change use_narwhals_backend with set_config() after
backends have already been registered, pandera re-registers them
automatically:
The global
CONFIG.use_narwhals_backendvalue is updated.Pandera detects which of the Polars / Ibis / PySpark register functions had already run.
Registration caches are cleared and existing registry entries for those backends are removed.
Only the backends that were previously registered are registered again, now using the new flag value.
A
UserWarningis emitted to make the swap visible.
import pandera.polars as pa
schema = pa.DataFrameSchema({"age": pa.Column(int)})
schema.validate(df) # uses native Polars backend (default)
pa.config.set_config(use_narwhals_backend=True)
# UserWarning: Re-registered pandera backends after use_narwhals_backend changed.
schema.validate(df) # same schema object, now validated by the Narwhals backend
Existing schema objects continue to work after re-registration. Schemas
do not store a backend reference at construction time; they look up the
registered backend from the global registry on each validate() call.
Re-registration applies only to backends that had already been registered in
the current process. If you call set_config(use_narwhals_backend=True)
before constructing any Polars/Ibis/PySpark schema, no re-registration occurs
— the first lazy registration picks up the updated config silently.
Note
Runtime re-registration is triggered by set_config(), which
updates the global CONFIG. The config_context manager overrides
settings for validation behaviour (for example, validation_depth) but does
not change which validation backend is registered. Use
set_config() (or the environment variable) to switch between
native and Narwhals backends.
What it is¶
Narwhals is a lightweight compatibility layer that provides a subset of the Polars expression API on top of multiple underlying DataFrame libraries (Polars, pandas, PyArrow, Modin, cuDF, Dask, Ibis, DuckDB, PySpark, etc.). Pandera uses it to express validation logic — column selection, type coercion, check evaluation, failure-case collection — once, and have it executed natively by each supported engine.
What it changes for you¶
Unified checks across Polars, Ibis, and PySpark SQL. Built-in checks (
isin,in_range,str_matches, etc.) are implemented as Narwhals expressions and run unchanged on Polars LazyFrames, Ibis tables, and PySpark SQL DataFrames when the Narwhals backend is enabled. PySpark SQL is a SQL-lazy backend: element-wise checks are not supported, and row sampling (sample=/tail=parameters) is not supported.Lazy validation stays lazy. For Polars LazyFrames, Ibis tables, and PySpark SQL DataFrames, Pandera threads validation through the native lazy API: no full-frame
.collect()/.execute()is triggered during validation. Only the boundedfailure_casesframe is materialized, and only on error.Custom checks become portable. A check written against
pandera.polarstypically works againstpandera.ibis(and vice versa) as long as it uses Narwhals expressions. Thenativeparameter onCheckcontrols which frame type the check function receives:native=True(the default) passes the native backend frame (e.g.pl.DataFrame,ibis.Table) so the check is backend-specific; settingnative=Falsepasses a Narwhals-wrapped frame so the check can run unchanged across all supported backends using only the Narwhals expression API.
PySpark SQL: differences from the native backend¶
Because the Narwhals backend for PySpark shares its check implementations with the Polars and Ibis backends, several behaviours differ from the native PySpark backend:
SQL-lazy execution. No element-wise checks (no
map_batcheson SQL-lazy frames), and no row sampling viasample=/tail=parameters.coerce=Trueis a no-op. The NarwhalsColumnBackendhas no coercion step. Settingcoerce=Trueon aFieldorColumnperforms no coercion; Pandera emits aSchemaWarningper column to make the subsequentWRONG_DATATYPEerror understandable rather than silent. Settingcoerce=Trueat theConfiglevel (row-wiseauto_coercedtype) is handled and does not warn. If you rely oncoerce=Trueto convert column dtypes, use the native PySpark backend (see Opting out).Custom checks using
PysparkDataframeColumnObjectare incompatible. Custom checks registered via@register_check_methodthat expect apyspark_obj: PysparkDataframeColumnObjectargument will not work under the Narwhals backend. The Narwhals backend passes aNarwhalsData(frame, key)named tuple to check functions instead, so the custom check signature and body must be rewritten against the Narwhals frame API (or kept on the native backend).failure_casesrows may be omitted for scalar Polars errors. Schema-level failure cases produced as scalar Polars frames (e.g. from a wrong-dtype check) are still reported in theerrorsdict but their rows are omitted from the aggregatedfailure_casesframe. See the Known gaps section for details.Unified
SchemaErrorscontract. Like the Polars and Ibis Narwhals backends, the PySpark Narwhals backend raisespandera.errors.SchemaErrorson validation failure (orSchemaErrorfor the first error whenlazy=False). This differs from the native PySpark backend, which attaches errors todataframe.pandera.errors. If you depend on thedataframe.pandera.errorsaccessor, use the native PySpark backend (see Opting out).
Opting out¶
The Narwhals backend is off by default, so no action is needed to
continue using the native Polars, Ibis, and PySpark backends. If you
previously opted in and want to switch back, unset the environment variable
(or set it to False):
unset PANDERA_USE_NARWHALS_BACKEND
# or
export PANDERA_USE_NARWHALS_BACKEND=False
Or call set_config() programmatically:
import pandera
pandera.set_config(use_narwhals_backend=False)
The native paths remain fully supported alongside the Narwhals path.
Known gaps¶
A small number of features are currently not wired through the Narwhals backend. Follow-up milestones track each of the gaps below:
Under the PySpark Narwhals backend, schema-level
failure_casesproduced as scalar Polars frames (e.g. from a wrong-dtype error) are still reported in theerrorsdict but their rows are omitted from the aggregatedfailure_casesframe. This is because scalar Polars frames cannot be converted to PySpark without a liveSparkSessionat the error-collection site; this gap is tracked for a future release.Column-level
coerce=Trueis currently a no-op for all Narwhals backends (Polars, Ibis, PySpark SQL). Pandera emits a one-timeSchemaWarningper column so the subsequentWRONG_DATATYPEerror is understandable rather than silent. Full column-level coercion support is tracked as a follow-up.coercefor the Ibis backend (deferred;Ibiscoerces eagerly today)add_missing_columnsparser andset_defaultforColumnfieldsgroup_by-based checks beyond element-wise and column-wise expressionsElement-wise checks for SQL-lazy backends (Ibis and PySpark SQL). As a consequence, the shared built-in check suite in
tests/common/does not run for the PySpark Narwhals backend (all shared checks are element-wise; running them would produce only skips with no useful coverage signal).Schema IO (YAML/JSON) for Narwhals-backed schemas
Hypothesis data-synthesis strategies
sample=/tail=row sampling for SQL-lazy backends (Ibis and PySpark SQL)check_unique(column-level uniqueness) does not produce a per-row booleancheck_output, sodrop_invalid_rows=Truecannot filter rows that fail a uniqueness constraint — those rows remain in the output. This gap is tracked for a future release.
See the Supported DataFrame Libraries page for the user-facing integrations; the Narwhals layer is an implementation detail that keeps them consistent.