Reading Third-Party Schema (new)

new in 0.7.0

Pandera now accepts schema from other data validation frameworks. This requires a pandera installation with the io extension; please see the installation instructions for more details.

Frictionless Data Schema

Note

Please see the Frictionless schema documentation for more information on this standard.

pandera.io.from_frictionless_schema(schema)[source]

Create a DataFrameSchema from either a frictionless json/yaml schema file saved on disk, or from a frictionless schema already loaded into memory.

Each field from the frictionless schema will be converted to a pandera column specification using FrictionlessFieldParser to map field characteristics to pandera column specifications.

Parameters

schema (Union[str, Path, Dict, Schema]) – the frictionless schema object (or a string/Path to the location on disk of a schema specification) to parse.

Return type

DataFrameSchema

Returns

dataframe schema with frictionless field specs converted to pandera column checks and constraints for use as normal.

Example

Here, we’re defining a very basic frictionless schema in memory before parsing it and then querying the resulting DataFrameSchema object as per any other Pandera schema:

>>> from pandera.io import from_frictionless_schema
>>>
>>> FRICTIONLESS_SCHEMA = {
...     "fields": [
...         {
...             "name": "column_1",
...             "type": "integer",
...             "constraints": {"minimum": 10, "maximum": 99}
...         },
...         {
...             "name": "column_2",
...             "type": "string",
...             "constraints": {"maxLength": 10, "pattern": "\S+"}
...         },
...     ],
...     "primaryKey": "column_1"
... }
>>> schema = from_frictionless_schema(FRICTIONLESS_SCHEMA)
>>> schema.columns["column_1"].checks
[<Check in_range: in_range(10, 99)>]
>>> schema.columns["column_1"].required
True
>>> schema.columns["column_1"].unique
True
>>> schema.columns["column_2"].checks
[<Check str_length: str_length(None, 10)>, <Check str_matches: str_matches(re.compile('^\\S+$'))>]

under the hood, this uses the FrictionlessFieldParser class to parse each frictionless field (column):

class pandera.io.FrictionlessFieldParser(field, primary_keys)[source]

Parses frictionless data schema field specifications so we can convert them to an equivalent Pandera Column schema.

For this implementation, we are using field names, constraints and types but leaving other frictionless parameters out (e.g. foreign keys, type formats, titles, descriptions).

Parameters
  • field – a field object from a frictionless schema.

  • primary_keys – the primary keys from a frictionless schema. These are used to ensure primary key fields are treated properly - no duplicates, no missing values etc.

property checks: Optional[Dict]

Convert a set of frictionless schema field constraints into checks.

This parses the standard set of frictionless constraints which can be found here and maps them into the equivalent pandera checks.

Return type

Optional[Dict]

Returns

a dictionary of pandera Check objects which capture the standard constraint logic of a frictionless schema field.

property coerce: bool

Determine whether values within this field should be coerced.

This currently returns True for all fields within a frictionless schema.

Return type

bool

property dtype: str

Determine what type of field this is, so we can feed that into DataType. If no type is specified in the frictionless schema, we default to string values.

Return type

str

Returns

the pandas-compatible representation of this field type as a string.

property nullable: bool

Determine whether this field can contain missing values.

If a field is a primary key, this will return False.

Return type

bool

property regex: bool

Determine whether this field name should be used for regex matches.

This currently returns False for all fields within a frictionless schema.

Return type

bool

property required: bool

Determine whether this field must exist within the data.

This currently returns True for all fields within a frictionless schema.

Return type

bool

to_pandera_column()[source]

Export this field to a column spec dictionary.

Return type

Dict

property unique: bool

Determine whether this field can contain duplicate values.

If a field is a primary key, this will return True.

Return type

bool