Until we specify splits explicitly, datasets.load_dataset would mix "main" and "margin" catalog data, which would cause a small data duplication. We should specify README metadata, something like:
---
configs:
- config_name: default
data_dir: "mmu_vipers_w4/dataset/"
- config_name: margin_10arcs
data_dir: "mmu_vipers_w4_10arcs/dataset/"
---
https://huggingface.co/docs/hub/datasets-manual-configuration
Until we specify splits explicitly,
datasets.load_datasetwould mix "main" and "margin" catalog data, which would cause a small data duplication. We should specify README metadata, something like:https://huggingface.co/docs/hub/datasets-manual-configuration