Describe the bug
Hey!
Firstly, thanks for maintaining such framework!
I had a small issue, where I wanted to load a custom dataset of image+text captioning. I had all of my images in a single directory, and one of the images had the name train.png. Then, the loaded dataset had only this image.
I guess it's related to "train" as a split name, but it's definitely an unexpected behavior :)
Unfortunately I don't have time to submit a proper PR. I'm attaching a toy example to reproduce the issue.
Thanks,
Sagi
Steps to reproduce the bug
All of the steps I'm attaching are in a fresh env :)
(base) sagipolaczek@Sagis-MacBook-Pro ~ % conda activate hf_issue_env
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % python --version
Python 3.10.15
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % pip list | grep datasets
datasets 3.0.1
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % ls -la Documents/hf_datasets_issue
total 352
drwxr-xr-x 6 sagipolaczek staff 192 Oct 7 11:59 .
drwx------@ 23 sagipolaczek staff 736 Oct 7 11:46 ..
-rw-r--r--@ 1 sagipolaczek staff 72 Oct 7 11:59 metadata.csv
-rw-r--r--@ 1 sagipolaczek staff 160154 Oct 6 18:00 pika.png
-rw-r--r--@ 1 sagipolaczek staff 5495 Oct 6 12:02 pika_pika.png
-rw-r--r--@ 1 sagipolaczek staff 1753 Oct 6 11:50 train.png
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % cat Documents/hf_datasets_issue/metadata.csv
file_name,text
train.png,A train
pika.png,Pika
pika_pika.png,Pika Pika!
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % python
Python 3.10.15 (main, Oct 3 2024, 02:33:33) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from datasets import load_dataset
>>> dataset = load_dataset("imagefolder", data_dir="Documents/hf_datasets_issue/")
>>> dataset
DatasetDict({
train: Dataset({
features: ['image', 'text'],
num_rows: 1
})
})
>>> dataset["train"][0]
{'image': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=354x84 at 0x10B50FD90>, 'text': 'A train'}
### DELETING `train.png` sample ###
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % vim Documents/hf_datasets_issue/metadata.csv
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % rm Documents/hf_datasets_issue/train.png
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % python
Python 3.10.15 (main, Oct 3 2024, 02:33:33) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from datasets import load_dataset
>>> dataset = load_dataset("imagefolder", data_dir="Documents/hf_datasets_issue/")
Generating train split: 2 examples [00:00, 65.99 examples/s]
>>> dataset
DatasetDict({
train: Dataset({
features: ['image', 'text'],
num_rows: 2
})
})
>>> dataset["train"]
Dataset({
features: ['image', 'text'],
num_rows: 2
})
>>> dataset["train"][0],dataset["train"][1]
({'image': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=2356x1054 at 0x10DD11E70>, 'text': 'Pika'}, {'image': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=343x154 at 0x10E258C70>, 'text': 'Pika Pika!'})
Expected behavior
My expected behavior would be to get a dataset with the sample train.png in it (along with the others data points).
Environment info
I've attached it in the example:
Python 3.10.15
datasets 3.0.1
Describe the bug
Hey!
Firstly, thanks for maintaining such framework!
I had a small issue, where I wanted to load a custom dataset of image+text captioning. I had all of my images in a single directory, and one of the images had the name
train.png. Then, the loaded dataset had only this image.I guess it's related to "train" as a split name, but it's definitely an unexpected behavior :)
Unfortunately I don't have time to submit a proper PR. I'm attaching a toy example to reproduce the issue.
Thanks,
Sagi
Steps to reproduce the bug
All of the steps I'm attaching are in a fresh env :)
Expected behavior
My expected behavior would be to get a dataset with the sample
train.pngin it (along with the others data points).Environment info
I've attached it in the example:
Python 3.10.15
datasets 3.0.1