Skip to content

Dataset.map ignores existing caches and remaps when ran with different num_proc #7433

Closed
@ringohoffman

Description

@ringohoffman

Describe the bug

If you map a dataset and save it to a specific cache_file_name with a specific num_proc, and then call map again with that same existing cache_file_name but a different num_proc, the dataset will be re-mapped.

Steps to reproduce the bug

  1. Download a dataset
import datasets

dataset = datasets.load_dataset("ylecun/mnist")
Generating train split: 100%|██████████| 60000/60000 [00:00<00:00, 116429.85 examples/s]
Generating test split: 100%|██████████| 10000/10000 [00:00<00:00, 103310.27 examples/s]
  1. map and cache it with a specific num_proc
cache_file_name="./cache/train.map"
dataset["train"].map(lambda x: x, cache_file_name=cache_file_name, num_proc=2)
Map (num_proc=2): 100%|██████████| 60000/60000 [00:01<00:00, 53764.03 examples/s]
  1. map it with a different num_proc and the same cache_file_name as before
dataset["train"].map(lambda x: x, cache_file_name=cache_file_name, num_proc=3)
Map (num_proc=3): 100%|██████████| 60000/60000 [00:00<00:00, 65377.12 examples/s]

Expected behavior

If I specify an existing cache_file_name, I don't expect using a different num_proc than the one that was used to generate it to cause the dataset to have be be re-mapped.

Environment info

$ datasets-cli env

- `datasets` version: 3.3.2
- Platform: Linux-5.15.0-131-generic-x86_64-with-glibc2.35
- Python version: 3.10.16
- `huggingface_hub` version: 0.29.1
- PyArrow version: 19.0.1
- Pandas version: 2.2.3
- `fsspec` version: 2024.12.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions