This repository contains code for converting Multimodal Universe datasets from HDF5 format to HATS format, to eventually ease the use of server-side crossmatching on Hugging Face as part of the MMU-Streaming project.
./main.py converts an MMU dataset in HDF5 format to HATS. Example usage:
uv run python ./main.py \
--transformer=sdss \
--input=https://users.flatironinstitute.org/~polymathic/data/MultimodalUniverse/v1/sdss/sdss/ \
--output=./hats \
--name=mmu_sdss_sdss \
--tmp-dir=./tmp \
--max-rows=8192Run with --help to see all available options.
This will create a dataset at ./hats/mmu_sdss_sdss.
While it is technically possible to specify a Hugging Face organization URI as the output
(e.g. --output=hf://datasets/LSDB/ to place the data in the LSDB/mmu_sdss_sdss repository),
in practice this may fail for large datasets due to the commit rate limit of 128 per hour.
It is therefore recommended to create the dataset locally and upload it using the Hugging Face client:
uvx --from huggingface-hub hf upload-large-folder \
--repo-type=dataset --num-workers=16 \
LSDB/mmu_sdss_sdss ./hats/mmu_sdss_sdssHere is an example of the result dataset: https://huggingface.co/datasets/LSDB/mmu_sdss_sdss/tree/main
Since dask is running in parallel it is notoriously hard to debug. In order to set breakpoints it's easiest to use only one runner, you can do that using the debug flag:
python ./main.py \
--input=https://users.flatironinstitute.org/~polymathic/data/MultimodalUniverse/v1/sdss/sdss/healpix=583/ \
--output=./hats \
--name=mmu_sdss_sdss \
--tmp-dir=./tmp \
--max-rows=8192 \
--debugNote that choosing a concrete healpix that is rather small (<1GB) speeds up the process a lot, since reading via HTTP is typically slower than from disk and depending on the individual connection.
The idea is to have one transformation class for each catalog. This class should follow a given structure as outlined in the catalog_functions.base_transformer.BaseTransformer class and override its abstract methods.
Note that we need to return a pyarrow.Table with exactly the same output as the <catalog>.py script. Read how to check this in the verification section below. An example is provided for SDSS, the relevant files are:
catalog_functions/sdss_transformer.pyverification/download_sdss.shprocess_sdss_using_datasets.py
Please note that most of the classes are vibe-coded and not verified yet. The only verified transformation class is SDSS.
There is an example implementation for SDSS. For data generation:
-
Run the datasets-based processing:
uv run --with-requirements=verification/requirements.in python verification/process_sdss_using_datasets.py
This will install
datasets==3.6and run the processing using datasets, no need to create another venv. Note that you'll neednumpy>1for the other jobs, so it is not feasible to install fromverification/requirements.inin your working virtualenv. -
Run the transform script:
python transform_scripts/transform_<catalog>_to_parquet.py
(Legacy:
python catalog_functions/sdss_transformer.py.) This script needs to be written first but can be copy-pasted. Adaptations may be needed to the function in the script, so that theobject_ids match. -
Both of these jobs will create their own parquet files in the data folder.
-
Make sure the created files match:
python verification/compare.py
-
Add the paths to the files in
verify.py. -
Once you've followed all these steps, you can simply do:
python verify.py <catalog_name>
Caveats:
- btsbot contains
test_*,train_*,val_*files