Data Validation with GeoPandas¶
new in 0.9.0
GeoPandas is an extension of Pandas that adds
support for geospatial data. You can use pandera to validate geopandas.GeoDataFrame
and geopandas.GeoSeries objects directly.
Usage¶
Install pandera with the geopandas extra:
pip install 'pandera[geopandas]'
Import :mod:pandera.geopandas as a single entry point: it includes everything
from :mod:pandera.pandas (Column, Check, Field, DataFrameModel,
dtypes, decorators, etc.) plus GeoDataFrameSchema and GeoDataFrameModel.
import pandera.geopandas as pg
For object-based validation, use GeoDataFrameSchema
so :meth:~pandera.api.geopandas.container.GeoDataFrameSchema.validate
returns a geopandas.GeoDataFrame. For the class-based API,
subclass GeoDataFrameModel or use
DataFrameModel when a plain :class:pandas.DataFrame
return type is enough.
import geopandas as gpd
import pandera.geopandas as pg
from shapely.geometry import Polygon
geo_schema = pg.GeoDataFrameSchema({
"geometry": pg.Column("geometry"),
"region": pg.Column(str),
})
geo_df = gpd.GeoDataFrame({
"geometry": [
Polygon(((0, 0), (0, 1), (1, 1), (1, 0))),
Polygon(((0, 0), (0, -1), (-1, -1), (-1, 0)))
],
"region": ["NA", "SA"]
})
geo_schema.validate(geo_df)
| geometry | region | |
|---|---|---|
| 0 | POLYGON ((0 0, 0 1, 1 1, 1 0, 0 0)) | NA |
| 1 | POLYGON ((0 0, 0 -1, -1 -1, -1 0, 0 0)) | SA |
You can also use the GeometryDtype data type in either instantiated or
un-instantiated form:
import geopandas as gpd
import pandera.geopandas as pg
# Use ``GeometryDtype`` or ``GeometryDtype()`` interchangeably here.
geo_schema = pg.DataFrameSchema({
"geometry": pg.Column(gpd.array.GeometryDtype()),
})
GeoDataFrameSchema¶
GeoDataFrameSchema accepts the same arguments as
DataFrameSchema but coerces the validated
result to a geopandas.GeoDataFrame when it would otherwise be a plain
:class:pandas.DataFrame. See also :ref:api-geopandas.
GeoDataFrameModel¶
Subclass GeoDataFrameModel instead of
DataFrameModel when you want
validate() (and
example(),
empty()) to return a
geopandas.GeoDataFrame even if the input was a plain
:class:pandas.DataFrame. Field definitions, checks, and Config behave the same
as for DataFrameModel; only the return type is coerced for downstream
geospatial workflows.
import pandas as pd
import pandera.geopandas as pg
from shapely.geometry import Polygon
from pandera.typing import Series
from pandera.typing.geopandas import GeoSeries
class GeoSchema(pg.GeoDataFrameModel):
geometry: GeoSeries
region: Series[str]
class Config:
coerce = True
gdf_in = pd.DataFrame(
{
"geometry": [
Polygon(((0, 0), (0, 1), (1, 1), (1, 0))),
Polygon(((0, 0), (0, -1), (-1, -1), (-1, 0))),
],
"region": ["NA", "SA"],
}
)
validated = GeoSchema.validate(gdf_in)
type(validated).__name__
'GeoDataFrame'
Validate on initialization¶
Use the GeoDataFrame generic with either
DataFrameModel or GeoDataFrameModel:
import pandera.geopandas as pg
from shapely.geometry import Polygon
from pandera.typing import Series
from pandera.typing.geopandas import GeoDataFrame, GeoSeries
class Schema(pg.DataFrameModel):
geometry: GeoSeries
region: Series[str]
df = GeoDataFrame[Schema](
{
'geometry': [
Polygon(((0, 0), (0, 1), (1, 1), (1, 0))),
Polygon(((0, 0), (0, -1), (-1, -1), (-1, 0)))
],
'region': ['NA','SA']
}
)
df
| geometry | region | |
|---|---|---|
| 0 | POLYGON ((0 0, 0 1, 1 1, 1 0, 0 0)) | NA |
| 1 | POLYGON ((0 0, 0 -1, -1 -1, -1 0, 0 0)) | SA |