Data Validation with GeoPandas

new in 0.9.0

GeoPandas is an extension of Pandas that adds support for geospatial data. You can use pandera to validate geopandas.GeoDataFrame and geopandas.GeoSeries objects directly.

Usage

Install pandera with the geopandas extra:

pip install 'pandera[geopandas]'

Import :mod:pandera.geopandas as a single entry point: it includes everything from :mod:pandera.pandas (Column, Check, Field, DataFrameModel, dtypes, decorators, etc.) plus GeoDataFrameSchema and GeoDataFrameModel.

import pandera.geopandas as pg

For object-based validation, use GeoDataFrameSchema so :meth:~pandera.api.geopandas.container.GeoDataFrameSchema.validate returns a geopandas.GeoDataFrame. For the class-based API, subclass GeoDataFrameModel or use DataFrameModel when a plain :class:pandas.DataFrame return type is enough.

import geopandas as gpd
import pandera.geopandas as pg
from shapely.geometry import Polygon

geo_schema = pg.GeoDataFrameSchema({
    "geometry": pg.Column("geometry"),
    "region": pg.Column(str),
})

geo_df = gpd.GeoDataFrame({
    "geometry": [
        Polygon(((0, 0), (0, 1), (1, 1), (1, 0))),
        Polygon(((0, 0), (0, -1), (-1, -1), (-1, 0)))
    ],
    "region": ["NA", "SA"]
})

geo_schema.validate(geo_df)
geometry region
0 POLYGON ((0 0, 0 1, 1 1, 1 0, 0 0)) NA
1 POLYGON ((0 0, 0 -1, -1 -1, -1 0, 0 0)) SA

You can also use the GeometryDtype data type in either instantiated or un-instantiated form:

import geopandas as gpd
import pandera.geopandas as pg

# Use ``GeometryDtype`` or ``GeometryDtype()`` interchangeably here.
geo_schema = pg.DataFrameSchema({
    "geometry": pg.Column(gpd.array.GeometryDtype()),
})

GeoDataFrameSchema

GeoDataFrameSchema accepts the same arguments as DataFrameSchema but coerces the validated result to a geopandas.GeoDataFrame when it would otherwise be a plain :class:pandas.DataFrame. See also :ref:api-geopandas.

GeoDataFrameModel

Subclass GeoDataFrameModel instead of DataFrameModel when you want validate() (and example(), empty()) to return a geopandas.GeoDataFrame even if the input was a plain :class:pandas.DataFrame. Field definitions, checks, and Config behave the same as for DataFrameModel; only the return type is coerced for downstream geospatial workflows.

import pandas as pd
import pandera.geopandas as pg
from shapely.geometry import Polygon

from pandera.typing import Series
from pandera.typing.geopandas import GeoSeries


class GeoSchema(pg.GeoDataFrameModel):
    geometry: GeoSeries
    region: Series[str]

    class Config:
        coerce = True


gdf_in = pd.DataFrame(
    {
        "geometry": [
            Polygon(((0, 0), (0, 1), (1, 1), (1, 0))),
            Polygon(((0, 0), (0, -1), (-1, -1), (-1, 0))),
        ],
        "region": ["NA", "SA"],
    }
)
validated = GeoSchema.validate(gdf_in)
type(validated).__name__
'GeoDataFrame'

Validate on initialization

Use the GeoDataFrame generic with either DataFrameModel or GeoDataFrameModel:

import pandera.geopandas as pg
from shapely.geometry import Polygon

from pandera.typing import Series
from pandera.typing.geopandas import GeoDataFrame, GeoSeries


class Schema(pg.DataFrameModel):
    geometry: GeoSeries
    region: Series[str]


df = GeoDataFrame[Schema](
    {
        'geometry': [
            Polygon(((0, 0), (0, 1), (1, 1), (1, 0))),
            Polygon(((0, 0), (0, -1), (-1, -1), (-1, 0)))
        ],
        'region': ['NA','SA']
    }
)
df
geometry region
0 POLYGON ((0 0, 0 1, 1 1, 1 0, 0 0)) NA
1 POLYGON ((0 0, 0 -1, -1 -1, -1 0, 0 0)) SA