MMU HDF5 to HATS Converter

This repository contains code for converting Multimodal Universe datasets from HDF5 format to HATS format, to eventually ease the use of server-side crossmatching on Hugging Face as part of the MMU-Streaming project.

Convert MMU datasets to HATS

./main.py converts an MMU dataset in HDF5 format to HATS. Example usage:

uv run python ./main.py \
  --transformer=sdss \
  --input=https://users.flatironinstitute.org/~polymathic/data/MultimodalUniverse/v1/sdss/sdss/ \
  --output=./hats \
  --name=mmu_sdss_sdss \
  --tmp-dir=./tmp \
  --max-rows=8192

Run with --help to see all available options.

This will create a dataset at ./hats/mmu_sdss_sdss. While it is technically possible to specify a Hugging Face organization URI as the output (e.g. --output=hf://datasets/LSDB/ to place the data in the LSDB/mmu_sdss_sdss repository), in practice this may fail for large datasets due to the commit rate limit of 128 per hour.

It is therefore recommended to create the dataset locally and upload it using the Hugging Face client:

uvx --from huggingface-hub hf upload-large-folder \
  --repo-type=dataset --num-workers=16 \
  LSDB/mmu_sdss_sdss ./hats/mmu_sdss_sdss

Here is an example of the result dataset: https://huggingface.co/datasets/LSDB/mmu_sdss_sdss/tree/main

Debugging workflow

Since dask is running in parallel it is notoriously hard to debug. In order to set breakpoints it's easiest to use only one runner, you can do that using the debug flag:

python ./main.py \
  --input=https://users.flatironinstitute.org/~polymathic/data/MultimodalUniverse/v1/sdss/sdss/healpix=583/ \
  --output=./hats \
  --name=mmu_sdss_sdss \
  --tmp-dir=./tmp \
  --max-rows=8192 \
  --debug

Note that choosing a concrete healpix that is rather small (<1GB) speeds up the process a lot, since reading via HTTP is typically slower than from disk and depending on the individual connection.

Transformation classes

The idea is to have one transformation class for each catalog. This class should follow a given structure as outlined in the catalog_functions.base_transformer.BaseTransformer class and override its abstract methods. Note that we need to return a pyarrow.Table with exactly the same output as the <catalog>.py script. Read how to check this in the verification section below. An example is provided for SDSS, the relevant files are:

catalog_functions/sdss_transformer.py
verification/download_sdss.sh
process_sdss_using_datasets.py

Please note that most of the classes are vibe-coded and not verified yet. The only verified transformation class is SDSS.

Verification of a transformation class

There is an example implementation for SDSS. For data generation:

Run the datasets-based processing:
```
uv run --with-requirements=verification/requirements.in python verification/process_sdss_using_datasets.py
```
This will install datasets==3.6 and run the processing using datasets, no need to create another venv. Note that you'll need numpy>1 for the other jobs, so it is not feasible to install from verification/requirements.in in your working virtualenv.
Run the transform script:
```
python transform_scripts/transform_<catalog>_to_parquet.py
```
(Legacy: python catalog_functions/sdss_transformer.py.) This script needs to be written first but can be copy-pasted. Adaptations may be needed to the function in the script, so that the object_ids match.
Both of these jobs will create their own parquet files in the data folder.
Make sure the created files match:
```
python verification/compare.py
```
Add the paths to the files in verify.py.
Once you've followed all these steps, you can simply do:
```
python verify.py <catalog_name>
```

Caveats:

btsbot contains test_*, train_*, val_* files

Name		Name	Last commit message	Last commit date
Latest commit History 1,414 Commits
.github/workflows		.github/workflows
catalog_download_scripts		catalog_download_scripts
catalog_functions		catalog_functions
mmu		mmu
tests		tests
transform_scripts		transform_scripts
verification		verification
.all-contributorsrc		.all-contributorsrc
.gitignore		.gitignore
.python-version		.python-version
COPYING		COPYING
README.md		README.md
TEST_DATA_SAMPLES.md		TEST_DATA_SAMPLES.md
TRANSFORMER_STATUS.md		TRANSFORMER_STATUS.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock
verify.py		verify.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMU HDF5 to HATS Converter

Convert MMU datasets to HATS

Debugging workflow

Transformation classes

Verification of a transformation class

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MMU HDF5 to HATS Converter

Convert MMU datasets to HATS

Debugging workflow

Transformation classes

Verification of a transformation class

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages