Skip to content

Compare schema against Hugging Face MMU datasets #47

@CloseChoice

Description

@CloseChoice

To check that the schema is really identical to what we get from the transformations we could load one line from the huggingface MMU and compare against this.

We should have one mapping with catalog names <-> huggingface url in Python then extract this in the compare.py and check the schemas.

Here's a link to the huggingface MMU:
https://huggingface.co/datasets/MultimodalUniverse/manga

Note that the transforming class also has a method that defines the schema, e.g. https://github.com/UniverseTBD/mmu-hdf-to-hats/blob/main/catalog_functions/btsbot_transformer.py#L114. So one could either load the generated files in the verify.py pipeline or load the transformer class and call the create_schema method and compare the resulting schemas against what one receives from huggingface.

Also making sure that this check is used in the CI is also part of this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions