Skip to content

load_dataset() of images from a single directory where train.png image exists #7201

@SagiPolaczek

Description

@SagiPolaczek

Describe the bug

Hey!
Firstly, thanks for maintaining such framework!

I had a small issue, where I wanted to load a custom dataset of image+text captioning. I had all of my images in a single directory, and one of the images had the name train.png. Then, the loaded dataset had only this image.

I guess it's related to "train" as a split name, but it's definitely an unexpected behavior :)
Unfortunately I don't have time to submit a proper PR. I'm attaching a toy example to reproduce the issue.

Thanks,
Sagi

Steps to reproduce the bug

All of the steps I'm attaching are in a fresh env :)

(base) sagipolaczek@Sagis-MacBook-Pro ~ % conda activate hf_issue_env
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % python --version
Python 3.10.15
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % pip list | grep datasets
datasets           3.0.1
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % ls -la Documents/hf_datasets_issue          
total 352
drwxr-xr-x   6 sagipolaczek  staff     192 Oct  7 11:59 .
drwx------@ 23 sagipolaczek  staff     736 Oct  7 11:46 ..
-rw-r--r--@  1 sagipolaczek  staff      72 Oct  7 11:59 metadata.csv
-rw-r--r--@  1 sagipolaczek  staff  160154 Oct  6 18:00 pika.png
-rw-r--r--@  1 sagipolaczek  staff    5495 Oct  6 12:02 pika_pika.png
-rw-r--r--@  1 sagipolaczek  staff    1753 Oct  6 11:50 train.png
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % cat Documents/hf_datasets_issue/metadata.csv
file_name,text
train.png,A train
pika.png,Pika
pika_pika.png,Pika Pika!


(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % python                                      
Python 3.10.15 (main, Oct  3 2024, 02:33:33) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from datasets import load_dataset
>>> dataset = load_dataset("imagefolder", data_dir="Documents/hf_datasets_issue/")
>>> dataset
DatasetDict({
    train: Dataset({
        features: ['image', 'text'],
        num_rows: 1
    })
})
>>> dataset["train"][0]
{'image': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=354x84 at 0x10B50FD90>, 'text': 'A train'}

### DELETING `train.png` sample ###
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % vim Documents/hf_datasets_issue/metadata.csv
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % rm Documents/hf_datasets_issue/train.png 
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % python                                      
Python 3.10.15 (main, Oct  3 2024, 02:33:33) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from datasets import load_dataset
>>> dataset = load_dataset("imagefolder", data_dir="Documents/hf_datasets_issue/")
Generating train split: 2 examples [00:00, 65.99 examples/s]
>>> dataset
DatasetDict({
    train: Dataset({
        features: ['image', 'text'],
        num_rows: 2
    })
})
>>> dataset["train"]
Dataset({
    features: ['image', 'text'],
    num_rows: 2
})
>>> dataset["train"][0],dataset["train"][1]
({'image': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=2356x1054 at 0x10DD11E70>, 'text': 'Pika'}, {'image': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=343x154 at 0x10E258C70>, 'text': 'Pika Pika!'})

Expected behavior

My expected behavior would be to get a dataset with the sample train.png in it (along with the others data points).

Environment info

I've attached it in the example:

Python 3.10.15
datasets 3.0.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions