DataFrame Schemas

The DataFrameSchema class enables the specification of a schema that verifies the columns and index of a pandas DataFrame object.

The DataFrameSchema object consists of Columns and an Index.

import pandera as pa

from pandera import Column, DataFrameSchema, Check, Index

schema = DataFrameSchema(
    {
        "column1": Column(pa.Int),
        "column2": Column(pa.Float, Check(lambda s: s < -1.2)),
        # you can provide a list of validators
        "column3": Column(pa.String, [
           Check(lambda s: s.str.startswith("value")),
           Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
        ]),
    },
    index=Index(pa.Int),
    strict=True,
    coerce=True,
)

Column Validation

A Column must specify the properties of a column in a dataframe object. It can be optionally verified for its data type, null values or duplicate values. The column can be coerced into the specified type, and the required parameter allows control over whether or not the column is allowed to be missing.

Column checks allow for the DataFrame’s values to be checked against a user-provided function. Check objects also support grouping by a different column so that the user can make assertions about subsets of the column of interest.

Column Hypotheses enable you to perform statistical hypothesis tests on a DataFrame in either wide or tidy format. See Hypothesis Testing for more details.

Null Values in Columns

By default, SeriesSchema/Column objects assume that values are not nullable. In order to accept null values, you need to explicitly specify nullable=True, or else you’ll get an error.

import numpy as np
import pandas as pd
import pandera as pa

from pandera import Check, Column, DataFrameSchema

df = pd.DataFrame({"column1": [5, 1, np.nan]})

non_null_schema = DataFrameSchema({
    "column1": Column(pa.Int, Check(lambda x: x > 0))
})

non_null_schema.validate(df)
Traceback (most recent call last):
...
SchemaError: non-nullable series contains null values: {2: nan}

Note

Due to a known limitation in pandas prior to version 0.24.0, integer arrays cannot contain NaN values, so this schema will return a DataFrame where column1 is of type float. PandasDtype does not currently support the nullable integer array type, but you can still use the “Int64” string alias for nullable integer arrays

null_schema = DataFrameSchema({
    "column1": Column(pa.Int, Check(lambda x: x > 0), nullable=True)
})

print(null_schema.validate(df))
   column1
0      5.0
1      1.0
2      NaN

Coercing Types on Columns

If you specify Column(dtype, ..., coerce=True) as part of the DataFrameSchema definition, calling schema.validate will first coerce the column into the specified dtype before applying validation checks.

import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema

df = pd.DataFrame({"column1": [1, 2, 3]})
schema = DataFrameSchema({"column1": Column(pa.String, coerce=True)})

validated_df = schema.validate(df)
assert isinstance(validated_df.column1.iloc[0], str)

Note

Note the special case of integers columns not supporting nan values. In this case, schema.validate will complain if coerce == True and null values are allowed in the column.

df = pd.DataFrame({"column1": [1., 2., 3, np.nan]})
schema = DataFrameSchema({
    "column1": Column(pa.Int, coerce=True, nullable=True)
})

validated_df = schema.validate(df)
Traceback (most recent call last):
...
pandera.errors.SchemaError: Error while coercing 'column1' to type int64: Cannot convert non-finite values (NA or inf) to integer

The best way to handle this case is to simply specify the column as a Float or Object.

schema_object = DataFrameSchema({
    "column1": Column(pa.Object, coerce=True, nullable=True)
})
schema_float = DataFrameSchema({
    "column1": Column(pa.Float, coerce=True, nullable=True)
})

print(schema_object.validate(df).dtypes)
print(schema_float.validate(df).dtypes)
column1    object
dtype: object
column1    float64
dtype: object

If you want to coerce all of the columns specified in the DataFrameSchema, you can specify the coerce argument with DataFrameSchema(..., coerce=True).

Required Columns

By default all columns specified in the schema are required, meaning that if a column is missing in the input DataFrame an exception will be thrown. If you want to make a column optional, specify required=False in the column constructor:

import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema

df = pd.DataFrame({"column2": ["hello", "pandera"]})
schema = DataFrameSchema({
    "column1": Column(pa.Int, required=False),
    "column2": Column(pa.String)
})

validated_df = schema.validate(df)
print(validated_df)
   column2
0    hello
1  pandera

Since required=True by default, missing columns would raise an error:

schema = DataFrameSchema({
    "column1": Column(pa.Int),
    "column2": Column(pa.String),
})

schema.validate(df)
Traceback (most recent call last):
...
pandera.SchemaError: column 'column1' not in dataframe
   column2
0    hello
1  pandera

Ordered Columns

Stand-alone Column Validation

In addition to being used in the context of a DataFrameSchema, Column objects can also be used to validate columns in a dataframe on its own:

import pandas as pd
import pandera as pa

df = pd.DataFrame({
    "column1": [1, 2, 3],
    "column2": ["a", "b", "c"],
})

column1_schema = pa.Column(pa.Int, name="column1")
column2_schema = pa.Column(pa.String, name="column2")

# pass the dataframe as an argument to the Column object callable
df = column1_schema(df)
validated_df = column2_schema(df)

# or explicitly use the validate method
df = column1_schema.validate(df)
validated_df = column2_schema.validate(df)

# use the DataFrame.pipe method to validate two columns
validated_df = df.pipe(column1_schema).pipe(column2_schema)

For multi-column use cases, the DataFrameSchema is still recommended, but if you have one or a small number of columns to verify, using Column objects by themselves is appropriate.

Column Regex Pattern Matching

In the case that your dataframe has multiple columns that share common statistical properties, you might want to specify a regex pattern that matches a set of meaningfully grouped columns that have str names.

import numpy as np
import pandas as pd
import pandera as pa

categories = ["A", "B", "C"]

np.random.seed(100)

dataframe = pd.DataFrame({
    "cat_var_1": np.random.choice(categories, size=100),
    "cat_var_2": np.random.choice(categories, size=100),
    "num_var_1": np.random.uniform(0, 10, size=100),
    "num_var_2": np.random.uniform(20, 30, size=100),
})

schema = pa.DataFrameSchema({
    "num_var_*": pa.Column(
        pa.Float,
        checks=pa.Check.greater_than_or_equal_to(0),
        regex=True,
    ),
    "cat_var_*": pa.Column(
        pa.Category,
        checks=pa.Check.isin(categories),
        coerce=True,
        regex=True,
    ),
})

print(schema.validate(dataframe).head())
  cat_var_1 cat_var_2  num_var_1  num_var_2
0         A         A   6.804147  24.743304
1         A         C   3.684308  22.774633
2         A         C   5.911288  28.416588
3         C         A   4.790627  21.951250
4         C         B   4.504166  28.563142

You can also regex pattern match on pd.MultiIndex columns:

np.random.seed(100)

dataframe = pd.DataFrame({
    ("cat_var_1", "y1"): np.random.choice(categories, size=100),
    ("cat_var_2", "y2"): np.random.choice(categories, size=100),
    ("num_var_1", "x1"): np.random.uniform(0, 10, size=100),
    ("num_var_2", "x2"): np.random.uniform(0, 10, size=100),
})

schema = pa.DataFrameSchema({
    ("num_var_*", "x*"): pa.Column(
        pa.Float,
        checks=pa.Check.greater_than_or_equal_to(0),
        regex=True,
    ),
    ("cat_var_*", "y*"): pa.Column(
        pa.Category,
        checks=pa.Check.isin(categories),
        coerce=True,
        regex=True,
    ),
})

print(schema.validate(dataframe).head())
  cat_var_1 cat_var_2 num_var_1 num_var_2
         y1        y2        x1        x2
0         A         A  6.804147  4.743304
1         A         C  3.684308  2.774633
2         A         C  5.911288  8.416588
3         C         A  4.790627  1.951250
4         C         B  4.504166  8.563142

Handling Dataframe Columns not in the Schema

By default, columns that aren’t specified in the schema aren’t checked. If you want to check that the DataFrame only contains columns in the schema, specify strict=True:

import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema

schema = DataFrameSchema(
    {"column1": Column(pa.Int)},
    strict=True)

df = pd.DataFrame({"column2": [1, 2, 3]})

schema.validate(df)
Traceback (most recent call last):
...
SchemaError: column 'column2' not in DataFrameSchema {'column1': <Schema Column: 'None' type=int>}

Validating the order of the columns

For some applications the order of the columns is important. For example:

  • If you want to use selection by position instead of the more common selection by label.

  • Machine learning: Many ML libraries will cast a Dataframe to numpy arrays, for which order becomes crucial.

To validate the order of the Dataframe columns, specify ordered=True:

import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema(
    columns={"a": pa.Column(pa.Int), "b": pa.Column(pa.Int)}, ordered=True
)
df = pd.DataFrame({"b": [1], "a": [1]})
print(schema.validate(df))
Traceback (most recent call last):
...
SchemaError: column 'b' out-of-order

Index Validation

You can also specify an Index in the DataFrameSchema.

import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema, Index, Check

schema = DataFrameSchema(
   columns={"a": Column(pa.Int)},
   index=Index(
       pa.String,
       Check(lambda x: x.str.startswith("index_"))))

df = pd.DataFrame(
    data={"a": [1, 2, 3]},
    index=["index_1", "index_2", "index_3"])

print(schema.validate(df))
         a
index_1  1
index_2  2
index_3  3

In the case that the DataFrame index doesn’t pass the Check.

df = pd.DataFrame(
    data={"a": [1, 2, 3]},
    index=["foo1", "foo2", "foo3"])

schema.validate(df)
Traceback (most recent call last):
...
SchemaError: <Schema Index> failed element-wise validator 0:
<lambda>
failure cases:
             index  count
failure_case
foo1           [0]      1
foo2           [1]      1
foo3           [2]      1

MultiIndex Validation

pandera also supports multi-index column and index validation.

MultiIndex Columns

Specifying multi-index columns follows the pandas syntax of specifying tuples for each level in the index hierarchy:

import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema, Index

schema = DataFrameSchema({
    ("foo", "bar"): Column(pa.Int),
    ("foo", "baz"): Column(pa.String)
})

df = pd.DataFrame({
    ("foo", "bar"): [1, 2, 3],
    ("foo", "baz"): ["a", "b", "c"],
})

print(schema.validate(df))
  foo
  bar baz
0   1   a
1   2   b
2   3   c

MultiIndex Indexes

The MultiIndex class allows you to define multi-index indexes by composing a list of pandera.Index objects.

import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema, Index, MultiIndex, Check

schema = DataFrameSchema(
    columns={"column1": Column(pa.Int)},
    index=MultiIndex([
        Index(pa.String,
              Check(lambda s: s.isin(["foo", "bar"])),
              name="index0"),
        Index(pa.Int, name="index1"),
    ])
)

df = pd.DataFrame(
    data={"column1": [1, 2, 3]},
    index=pd.MultiIndex.from_arrays(
        [["foo", "bar", "foo"], [0, 1,2 ]],
        names=["index0", "index1"]
    )
)

print(schema.validate(df))
               column1
index0 index1
foo    0             1
bar    1             2
foo    2             3

Get Pandas Datatypes

Pandas provides a dtype parameter for casting a dataframe to a specific dtype schema. DataFrameSchema provides a dtype property which returns a pandas style dict. The keys of the dict are column names and values are the dtype.

Some examples of where this can be provided to pandas are:

import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema(
    columns={
      "column1": pa.Column(pa.Int),
      "column2": pa.Column(pa.Category),
      "column3": pa.Column(pa.Bool)
    },
)

df = pd.DataFrame.from_dict(
  {
      "a": {"column1": 1, "column2": "valueA", "column3": True},
      "b": {"column1": 1, "column2": "valueB", "column3": True},
  },
  orient="index"
).astype(schema.dtype).sort_index(axis=1)

print(schema.validate(df))
   column1 column2  column3
a        1  valueA     True
b        1  valueB     True

DataFrameSchema Transformations

Once you’ve defined a schema, you can then make modifications to it, both on the schema level – such as adding or removing columns and setting or resetting the index – or on the column level – such as changing the data type or checks.

This is useful for re-using schema objects in a data pipeline when additional computation has been done on a dataframe, where the column objects may have changed or perhaps where additional checks may be required.

import pandas as pd
import pandera as pa

data = pd.DataFrame({"col1": range(1, 6)})

schema = pa.DataFrameSchema(
    columns={"col1": pa.Column(pa.Int, pa.Check(lambda s: s >= 0))},
    strict=True)

transformed_schema = schema.add_columns({
    "col2": pa.Column(pa.String, pa.Check(lambda s: s == "value")),
    "col3": pa.Column(pa.Float, pa.Check(lambda x: x == 0.0)),
})

# validate original data
data = schema.validate(data)

# transformation
transformed_data = data.assign(col2="value", col3=0.0)

# validate transformed data
print(transformed_schema.validate(transformed_data))
   col1   col2  col3
0     1  value   0.0
1     2  value   0.0
2     3  value   0.0
3     4  value   0.0
4     5  value   0.0

Similarly, if you want dropped columns to be explicitly validated in a data pipeline:

import pandera as pa

schema = pa.DataFrameSchema(
    columns={
        "col1": pa.Column(pa.Int, pa.Check(lambda s: s >= 0)),
        "col2": pa.Column(pa.String, pa.Check(lambda x: x <= 0)),
        "col3": pa.Column(pa.Object, pa.Check(lambda x: x == 0)),
    },
    strict=True,
)

new_schema = schema.remove_columns(["col2", "col3"])
print(new_schema)
DataFrameSchema(
    columns={
        "col1": "<Schema Column: 'col1' type=int>"
    },
    checks=[],
    index=None,
    coerce=False,
    strict=True
)

If during the course of a data pipeline one of your columns is moved into the index, you can simply update the initial input schema using the set_index() method to create a schema for the pipeline output.

import pandera as pa

from pandera import Column, DataFrameSchema, Check, Index

schema = DataFrameSchema(
    {
        "column1": Column(pa.Int),
        "column2": Column(pa.Float)
    },
    index=Index(pa.Int, name = "column3"),
    strict=True,
    coerce=True,
)
print(schema.set_index(["column1"], append = True))
DataFrameSchema(
    columns={
        "column2": "<Schema Column: 'column2' type=float>"
    },
    checks=[],
    index=MultiIndex(
    columns={
        "column3": "<Schema Column: 'column3' type=int>",
        "column1": "<Schema Column: 'column1' type=int>"
    },
    checks=[],
    index=None,
    coerce=False,
    strict=False
),
    coerce=True,
    strict=True
)

The available methods for altering the schema are: add_columns() , remove_columns(), update_columns(), rename_columns(), set_index(), and reset_index().