To check that the schema is really identical to what we get from the transformations we could load one line from the huggingface MMU and compare against this.
We should have one mapping with catalog names <-> huggingface url in Python then extract this in the compare.py and check the schemas.
Here's a link to the huggingface MMU:
https://huggingface.co/datasets/MultimodalUniverse/manga
Note that the transforming class also has a method that defines the schema, e.g. https://github.com/UniverseTBD/mmu-hdf-to-hats/blob/main/catalog_functions/btsbot_transformer.py#L114. So one could either load the generated files in the verify.py pipeline or load the transformer class and call the create_schema method and compare the resulting schemas against what one receives from huggingface.
Also making sure that this check is used in the CI is also part of this issue.
To check that the schema is really identical to what we get from the transformations we could load one line from the huggingface MMU and compare against this.
We should have one mapping with catalog names <-> huggingface url in Python then extract this in the
compare.pyand check the schemas.Here's a link to the huggingface MMU:
https://huggingface.co/datasets/MultimodalUniverse/manga
Note that the transforming class also has a method that defines the schema, e.g. https://github.com/UniverseTBD/mmu-hdf-to-hats/blob/main/catalog_functions/btsbot_transformer.py#L114. So one could either load the generated files in the
verify.pypipeline or load the transformer class and call thecreate_schemamethod and compare the resulting schemas against what one receives from huggingface.Also making sure that this check is used in the CI is also part of this issue.