Closed
Description
Describe the bug
If you map
a dataset and save it to a specific cache_file_name
with a specific num_proc
, and then call map again with that same existing cache_file_name
but a different num_proc
, the dataset will be re-mapped.
Steps to reproduce the bug
- Download a dataset
import datasets
dataset = datasets.load_dataset("ylecun/mnist")
Generating train split: 100%|██████████| 60000/60000 [00:00<00:00, 116429.85 examples/s]
Generating test split: 100%|██████████| 10000/10000 [00:00<00:00, 103310.27 examples/s]
map
and cache it with a specificnum_proc
cache_file_name="./cache/train.map"
dataset["train"].map(lambda x: x, cache_file_name=cache_file_name, num_proc=2)
Map (num_proc=2): 100%|██████████| 60000/60000 [00:01<00:00, 53764.03 examples/s]
map
it with a differentnum_proc
and the samecache_file_name
as before
dataset["train"].map(lambda x: x, cache_file_name=cache_file_name, num_proc=3)
Map (num_proc=3): 100%|██████████| 60000/60000 [00:00<00:00, 65377.12 examples/s]
Expected behavior
If I specify an existing cache_file_name
, I don't expect using a different num_proc
than the one that was used to generate it to cause the dataset to have be be re-mapped.
Environment info
$ datasets-cli env
- `datasets` version: 3.3.2
- Platform: Linux-5.15.0-131-generic-x86_64-with-glibc2.35
- Python version: 3.10.16
- `huggingface_hub` version: 0.29.1
- PyArrow version: 19.0.1
- Pandas version: 2.2.3
- `fsspec` version: 2024.12.0
Metadata
Metadata
Assignees
Labels
No labels