Dropping Invalid Rows¶

New in version 0.16.0

If you wish to use the validation step to remove invalid data, you can pass the drop_invalid_rows=True argument to the schema object on creation. On schema.validate(), if a data-level check fails, then that row which caused the failure will be removed from the dataframe when it is returned.

drop_invalid_rows will prevent data-level schema errors being raised and will instead remove the rows which causes the failure.

This functionality is available on DataFrameSchema, SeriesSchema, Column, as well as DataFrameModel schemas.

Note that this functionality works by identifying the index or multi-index of the failing rows. If the index is not unique on the dataframe, this could result in incorrect rows being dropped.

Dropping invalid rows with DataFrameSchema:

import pandas as pd
import pandera as pa

from pandera import Check, Column, DataFrameSchema

df = pd.DataFrame({"counter": ["1", "2", "3"]})
schema = DataFrameSchema(
    {"counter": Column(int, checks=[Check(lambda x: x >= 3)])},
    drop_invalid_rows=True,
)

schema.validate(df, lazy=True)

	counter
0	1
1	2
2	3

Dropping invalid rows with SeriesSchema:

import pandas as pd
import pandera as pa

from pandera import Check, SeriesSchema

series = pd.Series(["1", "2", "3"])
schema = SeriesSchema(
    int,
    checks=[Check(lambda x: x >= 3)],
    drop_invalid_rows=True,
)

schema.validate(series, lazy=True)

  1
  2
  3
dtype: object

Dropping invalid rows with Column:

import pandas as pd
import pandera as pa

from pandera import Check, Column

df = pd.DataFrame({"counter": ["1", "2", "3"]})
schema = Column(
    int,
    name="counter",
    drop_invalid_rows=True,
    checks=[Check(lambda x: x >= 3)]
)

schema.validate(df, lazy=True)

	counter
0	1
1	2
2	3

Dropping invalid rows with DataFrameModel:

import pandas as pd
import pandera as pa

from pandera import Check, DataFrameModel, Field

class MySchema(DataFrameModel):
    counter: int = Field(in_range={"min_value": 3, "max_value": 5})

    class Config:
        drop_invalid_rows = True


MySchema.validate(
    pd.DataFrame({"counter": [1, 2, 3, 4, 5, 6]}), lazy=True
)

	counter
2	3
3	4
4	5

Note

In order to use drop_invalid_rows=True, lazy=True must be passed to the schema.validate(). Lazy Validation enables all schema errors to be collected and raised together, meaning all invalid rows can be dropped together. This provides clear API for ensuring the validated dataframe contains only valid data.