Investigate validation against JSON schema directly from (Py)Spark

Currently, we validate every file generated or ingested by the Data team against the [JSON schema.](https://github.com/opentargets/json_schema) We use the internally developed [opentargets_validator](https://github.com/opentargets/validator) package, which in turn depends on the popular [`jsonschema`](https://github.com/python-jsonschema/jsonschema) library.

The issue with this approach is that it limits the output of our pipelines to JSON, where it would potentially be easier for downstream pipelines (ETL) if we used for example Parquet or another similar format.

The goal of this issue is to find a way of validating the data directly inside PySpark, while still using the JSON schema as the single source of truth (at least, for now)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate validation against JSON schema directly from (Py)Spark #2722

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate validation against JSON schema directly from (Py)Spark #2722

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions