Skip to content

Investigate validation against JSON schema directly from (Py)Spark #2722

@tskir

Description

@tskir

Currently, we validate every file generated or ingested by the Data team against the JSON schema. We use the internally developed opentargets_validator package, which in turn depends on the popular jsonschema library.

The issue with this approach is that it limits the output of our pipelines to JSON, where it would potentially be easier for downstream pipelines (ETL) if we used for example Parquet or another similar format.

The goal of this issue is to find a way of validating the data directly inside PySpark, while still using the JSON schema as the single source of truth (at least, for now)

Metadata

Metadata

Assignees

Labels

DataRelates to Open Targets data team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions