Memory leak when streaming

### Describe the bug

I try to use a dataset with streaming=True, the issue I have is that the RAM usage becomes higher and higher until it is no longer sustainable.

I understand that huggingface store data in ram during the streaming, and more worker in dataloader there are, more a lot of shard will be stored in ram, but the issue I have is that the ram usage is not constant. So after each new shard loaded, the ram usage will be higher and higher.

### Steps to reproduce the bug

You can run this code and see you ram usage, after each shard of 255 examples, your ram usage will be extended.
```py
from datasets import load_dataset
from torch.utils.data import DataLoader

dataset = load_dataset("WaveGenAI/dataset", streaming=True)

dataloader = DataLoader(dataset["train"], num_workers=3)

for i, data in enumerate(dataloader):
    print(i, end="\r")
```

### Expected behavior

The Ram usage should be always the same (just 3 shards loaded in the ram).

### Environment info

- `datasets` version: 3.0.1
- Platform: Linux-6.10.5-arch1-1-x86_64-with-glibc2.40
- Python version: 3.12.4
- `huggingface_hub` version: 0.26.0
- PyArrow version: 17.0.0
- Pandas version: 2.2.3
- `fsspec` version: 2024.6.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak when streaming #7269

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Memory leak when streaming #7269

Description

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions