Currently, we validate every file generated or ingested by the Data team against the JSON schema. We use the internally developed opentargets_validator package, which in turn depends on the popular jsonschema library.
The issue with this approach is that it limits the output of our pipelines to JSON, where it would potentially be easier for downstream pipelines (ETL) if we used for example Parquet or another similar format.
The goal of this issue is to find a way of validating the data directly inside PySpark, while still using the JSON schema as the single source of truth (at least, for now)