This is a tracker issue for all catalog.py files.
Each catalog.py file can be picked up individually and should result in exactly one PR. The transform_scripts/transform_<catalog>_to_parquet.py have been written by claude, so they need to be checked thoroughly since they can contain mistakes.
Here is a reference PR: #53
To start, do the following:
- choose a non-checked script (see list below) from https://github.com/UniverseTBD/mmu-hdf-to-hats/tree/main/catalog_download_scripts
- then check the corresponding catalog on https://users.flatironinstitute.org/~polymathic/data/MultimodalUniverse/v1 and find a small healpix (<100MB is ideal)
- write a download script, e.g. verification/download_sdss.sh using the healpix you found
- write a process__using_datasets.py, execute it using the command
uv run --with-requirements=verification/requirements.in python verification/process_<catalog>_using_datasets.py (this is important, since it will run with datasets==3.6 which is the last version to support custom scripts)
- write a catalog_functions/_transformer.py
- write a transform_scripts/transform__to_parquet.py and run it (simply using
python transform_scripts/transform_<catalog>_to_parquet.py is fine here)
- run
python verification/compare.py <path1> <path2> where the paths are the output paths of process_<catalog>_using_datasets.py and transform_scripts/transform_<catalog>_to_parquet.py
- add the catalog to the CI workflow here
- add the corresponding files in
verify.py, see here
Problems that can arise:
- in desi a negation operator for a boolean column was missing
- float conversion can be problematic due to different float types
- the object id matching when adding the coordinates can be off, since object_ids are differently formatted across the catalogs
Tracker list:
This is a tracker issue for all catalog.py files.
Each catalog.py file can be picked up individually and should result in exactly one PR. The
transform_scripts/transform_<catalog>_to_parquet.pyhave been written by claude, so they need to be checked thoroughly since they can contain mistakes.Here is a reference PR: #53
To start, do the following:
uv run --with-requirements=verification/requirements.in python verification/process_<catalog>_using_datasets.py(this is important, since it will run withdatasets==3.6which is the last version to support custom scripts)python transform_scripts/transform_<catalog>_to_parquet.pyis fine here)python verification/compare.py <path1> <path2>where the paths are the output paths ofprocess_<catalog>_using_datasets.pyandtransform_scripts/transform_<catalog>_to_parquet.pyverify.py, see hereProblems that can arise:
Tracker list:
2dcff3dof this repo)[ ] gui.py(not a catalog script)[ ] start.py(not a catalog script)