-
Notifications
You must be signed in to change notification settings - Fork 3.2k
push_to_hub is not concurrency safe (dataset schema corruption) #7600
Description
Describe the bug
Concurrent processes modifying and pushing a dataset can overwrite each others' dataset card, leaving the dataset unusable.
Consider this scenario:
- we have an Arrow dataset
- there are
Nconfigs of the dataset - there are
Nindependent processes operating on each of the individual configs (e.g. adding a column,new_col) - each process calls
push_to_hubon their particular config when they're done processing - all calls to
push_to_hubsucceed - the
README.mdnow has some configs withnew_coladded and some withnew_colmissing
Any attempt to load a config (using load_dataset) where new_col is missing will fail because of a schema mismatch between README.md and the Arrow files. Fixing the dataset requires updating README.md by hand with the correct schema for the affected config. In effect, push_to_hub is doing a git push --force (I found this behavior quite surprising).
We have hit this issue every time we run processing jobs over our datasets and have to fix corrupted schemas by hand.
Reading through the code, it seems that specifying a parent_commit hash around here https://github.com/huggingface/datasets/blob/main/src/datasets/arrow_dataset.py#L5794 would get us to a normal, non-forced git push, and avoid schema corruption. I'm not familiar enough with the code to know how to determine the commit hash from which the in-memory dataset card was loaded.
Steps to reproduce the bug
See above.
Expected behavior
Concurrent edits to disjoint configs of a dataset should never corrupt the dataset schema.
Environment info
datasetsversion: 2.20.0- Platform: Linux-5.15.0-118-generic-x86_64-with-glibc2.35
- Python version: 3.10.14
huggingface_hubversion: 0.30.2- PyArrow version: 19.0.1
- Pandas version: 2.2.2
fsspecversion: 2023.9.0