Skip to content

Commit cc7e856

Browse files
authored
add contribution instructions for prototype datasets (#5133)
* add contribution instructions for prototype datasets * cleanup * fix links * Update torchvision/prototype/datasets/_builtin/README.md
1 parent e65a857 commit cc7e856

File tree

10 files changed

+169
-19
lines changed

10 files changed

+169
-19
lines changed
Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
# How to add new built-in prototype datasets
2+
3+
As the name implies, the datasets are still in a prototype state and thus subject to rapid change. This in turn means that this document will also change a lot.
4+
5+
If you hit a blocker while adding a dataset, please have a look at another similar dataset to see how it is implemented there. If you can't resolve it yourself, feel free to send a draft PR in order for us to help you out.
6+
7+
Finally, `from torchvision.prototype import datasets` is implied below.
8+
9+
## Implementation
10+
11+
Before we start with the actual implementation, you should create a module in `torchvision/prototype/datasets/_builtin` that hints at the dataset you are going to add. For example `caltech.py` for `caltech101` and `caltech256`. In that module create a class that inherits from `datasets.utils.Dataset` and overwrites at minimum three methods that will be discussed in detail below:
12+
13+
```python
14+
import io
15+
from typing import Any, Callable, Dict, List, Optional
16+
17+
import torch
18+
from torchdata.datapipes.iter import IterDataPipe
19+
from torchvision.prototype.datasets.utils import Dataset, DatasetInfo, DatasetConfig, OnlineResource
20+
21+
class MyDataset(Dataset):
22+
def _make_info(self) -> DatasetInfo:
23+
...
24+
25+
def resources(self, config: DatasetConfig) -> List[OnlineResource]:
26+
...
27+
28+
def _make_datapipe(
29+
self,
30+
resource_dps: List[IterDataPipe],
31+
*,
32+
config: DatasetConfig,
33+
decoder: Optional[Callable[[io.IOBase], torch.Tensor]],
34+
) -> IterDataPipe[Dict[str, Any]]:
35+
...
36+
```
37+
38+
### `_make_info(self)`
39+
40+
The `DatasetInfo` carries static information about the dataset. There are two required fields:
41+
- `name`: Name of the dataset. This will be used to load the dataset with `datasets.load(name)`. Should only contain lower characters.
42+
- `type`: Field of the `datasets.utils.DatasetType` enum. This is used to select the default decoder in case the user doesn't pass one. There are currently only two options: `IMAGE` and `RAW` ([see below](what-is-the-datasettyperaw-and-when-do-i-use-it) for details).
43+
44+
There are more optional parameters that can be passed:
45+
46+
- `dependencies`: Collection of third-party dependencies that are needed to load the dataset, e.g. `("scipy",)`. Their availability will be automatically checked if a user tries to load the dataset. Within the implementation, import these packages lazily to avoid missing dependencies at import time.
47+
- `categories`: Sequence of human-readable category names for each label. The index of each category has to match the corresponding label returned in the dataset samples. [See below](#how-do-i-handle-a-dataset-that-defines-many-categories) how to handle cases with many categories.
48+
- `valid_options`: Configures valid options that can be passed to the dataset. It should be `Dict[str, Sequence[str]]`. The options are accessible through the `config` namespace in the other two functions. First value of the sequence is taken as default if the user passes no option to `torchvision.prototype.datasets.load()`.
49+
50+
## `resources(self, config)`
51+
52+
Returns `List[datasets.utils.OnlineResource]` of all the files that need to be present locally before the dataset with a specific `config` can be build. The download will happen automatically.
53+
54+
Currently, the following `OnlineResource`'s are supported:
55+
56+
- `HttpResource`: Used for files that are directly exposed through HTTP(s) and only requires the URL.
57+
- `GDriveResource`: Used for files that are hosted on GDrive and requires the GDrive ID as well as the `file_name`.
58+
- `ManualDownloadResource`: Used files are not publicly accessible and requires instructions how to download them manually. If the file does not exist, an error will be raised with the supplied instructions.
59+
60+
Although optional in general, all resources used in the built-in datasets should comprise [SHA256](https://en.wikipedia.org/wiki/SHA-2) checksum for security. It will be automatically checked after the download. You can compute the checksum with system utilities or this snippet:
61+
62+
```python
63+
import hashlib
64+
65+
def sha256sum(path, chunk_size=1024 * 1024):
66+
checksum = hashlib.sha256()
67+
with open(path, "rb") as f:
68+
for chunk in iter(lambda: f.read(chunk_size), b""):
69+
checksum.update(chunk)
70+
print(checksum.hexdigest())
71+
```
72+
73+
### `_make_datapipe(resource_dps, *, config, decoder)`
74+
75+
This method is the heart of the dataset that need to transform the raw data into a usable form. A major difference compared to the current stable datasets is that everything is performed through `IterDataPipe`'s. From the perspective of someone that is working with them rather than on them, `IterDataPipe`'s behave just as generators, i.e. you can't do anything with them besides iterating.
76+
77+
Of course, there are some common building blocks that should suffice in 95% of the cases. The most used
78+
79+
- `Mapper`: Apply a callable to every item in the datapipe.
80+
- `Filter`: Keep only items that satisfy a condition.
81+
- `Demultiplexer`: Split a datapipe into multiple ones.
82+
- `IterKeyZipper`: Merge two datapipes into one.
83+
84+
All of them can be imported `from torchdata.datapipes.iter`. In addition, use `functools.partial` in case a callable needs extra arguments. If the provided `IterDataPipe`'s are not sufficient for the use case, it is also not complicated to add one. See the MNIST or CelebA datasets for example.
85+
86+
`make_datapipe()` receives `resource_dps`, which is a list of datapipes that has a 1-to-1 correspondence with the return value of `resources()`. In case of archives with regular suffixes (`.tar`, `.zip`, ...), the datapipe will contain tuples comprised of the path and the handle for every file in the archive. Otherwise the datapipe will only contain one of such tuples for the file specified by the resource.
87+
88+
Since the datapipes are iterable in nature, some datapipes feature an in-memory buffer, e.g. `IterKeyZipper` and `Grouper`. There are two issues with that:
89+
1. If not used carefully, this can easily overflow the host memory, since most datasets will not fit in completely.
90+
2. This can lead to unnecessarily long warm-up times when data is buffered that is only needed at runtime.
91+
92+
Thus, all buffered datapipes should be used as early as possible, e.g. zipping two datapipes of file handles rather than trying to zip already loaded images.
93+
94+
There are two special datapipes that are not used through their class, but through the functions `hint_sharding` and `hint_shuffling`. As the name implies they only hint part in the datapipe graph where sharding and shuffling should take place, but are no-ops by default. They can be imported from `torchvision.prototype.datasets.utils._internal` and are required in each dataset.
95+
96+
Finally, each item in the final datapipe should be a dictionary with `str` keys. There is no standardization of the names (yet!).
97+
98+
## FAQ
99+
100+
### What is the `DatasetType.RAW` and when do I use it?
101+
102+
`DatasetType.RAW` marks dataset that provides decoded, i.e. raw pixel values, rather than encoded image files such as
103+
`.jpg` or `.png`. This is usually only the case for small datasets, since it requires a lot more disk space. The default decoder `datasets.decoder.raw` is only a sentinel and should not be called directly. The decoding should look something like
104+
105+
```python
106+
from torchvision.prototype.datasets.decoder import raw
107+
108+
image = ...
109+
110+
if decoder is raw:
111+
image = Image(image)
112+
else:
113+
image_buffer = image_buffer_from_raw(image)
114+
image = decoder(image_buffer) if decoder else image_buffer
115+
```
116+
117+
For examples, have a look at the MNIST, CIFAR, or SEMEION datasets.
118+
119+
### How do I handle a dataset that defines many categories?
120+
121+
As a rule of thumb, `datasets.utils.DatasetInfo(..., categories=)` should only be set directly for ten categories or fewer. If more categories are needed, you can add a `$NAME.categories` file to the `_builtin` folder in which each line specifies a category. If `$NAME` matches the name of the dataset (which it definitively should!) it will be automatically loaded if `categories=` is not set.
122+
123+
In case the categories can be generated from the dataset files, e.g. the dataset follow an image folder approach where each folder denotes the name of the category, the dataset can overwrite the `_generate_categories` method. It gets passed the `root` path to the resources, but they have to be manually loaded, e.g. `self.resources(config)[0].load(root)`. The method should return a sequence of strings representing the category names. To generate the `$NAME.categories` file, run `python -m torchvision.prototype.datasets.generate_category_files $NAME`.
124+
125+
### What if a resource file forms an I/O bottleneck?
126+
127+
In general, we are ok with small performance hits of iterating archives rather than their extracted content. However, if the performance hit becomes significant, the archives can still be decompressed or extracted. To do this, the `decompress: bool` and `extract: bool` flags can be used for every `OnlineResource` individually. For more complex cases, each resource also accepts a `preprocess` callable that gets passed a `pathlib.Path` of the raw file and should return `pathlib.Path` of the preprocessed file or folder.

torchvision/prototype/datasets/_builtin/caltech.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -136,8 +136,11 @@ def _make_datapipe(
136136
return Mapper(dp, functools.partial(self._collate_and_decode_sample, decoder=decoder))
137137

138138
def _generate_categories(self, root: pathlib.Path) -> List[str]:
139-
dp = self.resources(self.default_config)[0].load(pathlib.Path(root) / self.name)
139+
resources = self.resources(self.default_config)
140+
141+
dp = resources[0].load(root)
140142
dp = Filter(dp, self._is_not_background_image)
143+
141144
return sorted({pathlib.Path(path).parent.name for path, _ in dp})
142145

143146

@@ -189,6 +192,9 @@ def _make_datapipe(
189192
return Mapper(dp, functools.partial(self._collate_and_decode_sample, decoder=decoder))
190193

191194
def _generate_categories(self, root: pathlib.Path) -> List[str]:
192-
dp = self.resources(self.default_config)[0].load(pathlib.Path(root) / self.name)
195+
resources = self.resources(self.default_config)
196+
197+
dp = resources[0].load(root)
193198
dir_names = {pathlib.Path(path).parent.name for path, _ in dp}
199+
194200
return [name.split(".")[1] for name in sorted(dir_names)]

torchvision/prototype/datasets/_builtin/cifar.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,9 +92,12 @@ def _make_datapipe(
9292
return Mapper(dp, functools.partial(self._collate_and_decode, decoder=decoder))
9393

9494
def _generate_categories(self, root: pathlib.Path) -> List[str]:
95-
dp = self.resources(self.default_config)[0].load(pathlib.Path(root) / self.name)
95+
resources = self.resources(self.default_config)
96+
97+
dp = resources[0].load(root)
9698
dp = Filter(dp, path_comparator("name", self._META_FILE_NAME))
9799
dp = Mapper(dp, self._unpickle)
100+
98101
return cast(List[str], next(iter(dp))[self._CATEGORIES_KEY])
99102

100103

torchvision/prototype/datasets/_builtin/coco.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -238,7 +238,7 @@ def _generate_categories(self, root: pathlib.Path) -> Tuple[Tuple[str, str]]:
238238
config = self.default_config
239239
resources = self.resources(config)
240240

241-
dp = resources[1].load(pathlib.Path(root) / self.name)
241+
dp = resources[1].load(root)
242242
dp = Filter(
243243
dp,
244244
functools.partial(self._filter_meta_files, split=config.split, year=config.year, annotations="instances"),

torchvision/prototype/datasets/_builtin/imagenet.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -177,7 +177,8 @@ def _make_datapipe(
177177

178178
def _generate_categories(self, root: pathlib.Path) -> List[Tuple[str, ...]]:
179179
resources = self.resources(self.default_config)
180-
devkit_dp = resources[1].load(root / self.name)
180+
181+
devkit_dp = resources[1].load(root)
181182
devkit_dp = Filter(devkit_dp, path_comparator("name", "meta.mat"))
182183

183184
meta = next(iter(devkit_dp))[1]

torchvision/prototype/datasets/_builtin/mnist.py

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
import operator
55
import pathlib
66
import string
7-
from typing import Any, Callable, Dict, Iterator, List, Optional, Tuple, cast, BinaryIO
7+
from typing import Any, Callable, Dict, Iterator, List, Optional, Tuple, cast, BinaryIO, Union, Sequence
88

99
import torch
1010
from torchdata.datapipes.iter import (
@@ -78,7 +78,7 @@ def __iter__(self) -> Iterator[torch.Tensor]:
7878

7979

8080
class _MNISTBase(Dataset):
81-
_URL_BASE: str
81+
_URL_BASE: Union[str, Sequence[str]]
8282

8383
@abc.abstractmethod
8484
def _files_and_checksums(self, config: DatasetConfig) -> Tuple[Tuple[str, str], Tuple[str, str]]:
@@ -90,8 +90,15 @@ def resources(self, config: DatasetConfig) -> List[OnlineResource]:
9090
labels_sha256,
9191
) = self._files_and_checksums(config)
9292

93-
images = HttpResource(f"{self._URL_BASE}/{images_file}", sha256=images_sha256)
94-
labels = HttpResource(f"{self._URL_BASE}/{labels_file}", sha256=labels_sha256)
93+
url_bases = self._URL_BASE
94+
if isinstance(url_bases, str):
95+
url_bases = (url_bases,)
96+
97+
images_urls = [f"{url_base}/{images_file}" for url_base in url_bases]
98+
images = HttpResource(images_urls[0], sha256=images_sha256, mirrors=images_urls[1:])
99+
100+
labels_urls = [f"{url_base}/{labels_file}" for url_base in url_bases]
101+
labels = HttpResource(labels_urls[0], sha256=images_sha256, mirrors=labels_urls[1:])
95102

96103
return [images, labels]
97104

@@ -151,7 +158,10 @@ def _make_info(self) -> DatasetInfo:
151158
),
152159
)
153160

154-
_URL_BASE = "http://yann.lecun.com/exdb/mnist"
161+
_URL_BASE: Union[str, Sequence[str]] = (
162+
"http://yann.lecun.com/exdb/mnist",
163+
"https://ossci-datasets.s3.amazonaws.com/mnist/",
164+
)
155165
_CHECKSUMS = {
156166
"train-images-idx3-ubyte.gz": "440fcabf73cc546fa21475e81ea370265605f56be210a4024d2ca8f203523609",
157167
"train-labels-idx1-ubyte.gz": "3552534a0a558bbed6aed32b30c495cca23d567ec52cac8be1a0730e8010255c",

torchvision/prototype/datasets/_builtin/sbd.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -156,7 +156,9 @@ def _make_datapipe(
156156
return Mapper(dp, functools.partial(self._collate_and_decode_sample, config=config, decoder=decoder))
157157

158158
def _generate_categories(self, root: pathlib.Path) -> Tuple[str, ...]:
159-
dp = self.resources(self.default_config)[0].load(pathlib.Path(root) / self.name)
159+
resources = self.resources(self.default_config)
160+
161+
dp = resources[0].load(root)
160162
dp = Filter(dp, path_comparator("name", "category_names.m"))
161163
dp = LineReader(dp)
162164
dp = Mapper(dp, bytes.decode, input_col=1)

torchvision/prototype/datasets/generate_category_files.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111

1212

1313
def main(*names, force=False):
14-
root = pathlib.Path(datasets.home())
14+
home = pathlib.Path(datasets.home())
1515

1616
for name in names:
1717
path = BUILTIN_DIR / f"{name}.categories"
@@ -20,13 +20,14 @@ def main(*names, force=False):
2020

2121
dataset = find(name)
2222
try:
23-
categories = dataset._generate_categories(root)
23+
categories = dataset._generate_categories(home / name)
2424
except NotImplementedError:
2525
continue
2626

27-
with open(path, "w", newline="") as file:
27+
with open(path, "w") as file:
28+
writer = csv.writer(file, lineterminator="\n")
2829
for category in categories:
29-
csv.writer(file).writerow((category,) if isinstance(category, str) else category)
30+
writer.writerow((category,) if isinstance(category, str) else category)
3031

3132

3233
def parse_args(argv=None):

torchvision/prototype/datasets/utils/_dataset.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
import itertools
77
import os
88
import pathlib
9-
from typing import Any, Callable, Dict, List, Optional, Sequence, Union, Tuple
9+
from typing import Any, Callable, Dict, List, Optional, Sequence, Union, Tuple, Collection
1010

1111
import torch
1212
from torch.utils.data import IterDataPipe
@@ -33,7 +33,7 @@ def __init__(
3333
name: str,
3434
*,
3535
type: Union[str, DatasetType],
36-
dependencies: Sequence[str] = (),
36+
dependencies: Collection[str] = (),
3737
categories: Optional[Union[int, Sequence[str], str, pathlib.Path]] = None,
3838
citation: Optional[str] = None,
3939
homepage: Optional[str] = None,

torchvision/prototype/datasets/utils/_resource.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -136,14 +136,14 @@ def _check_sha256(self, path: pathlib.Path, *, chunk_size: int = 1024 * 1024) ->
136136

137137
class HttpResource(OnlineResource):
138138
def __init__(
139-
self, url: str, *, file_name: Optional[str] = None, mirrors: Optional[Sequence[str]] = None, **kwargs: Any
139+
self, url: str, *, file_name: Optional[str] = None, mirrors: Sequence[str] = (), **kwargs: Any
140140
) -> None:
141141
super().__init__(file_name=file_name or pathlib.Path(urlparse(url).path).name, **kwargs)
142142
self.url = url
143143
self.mirrors = mirrors
144144

145145
def _download(self, root: pathlib.Path) -> None:
146-
for url in itertools.chain((self.url,), self.mirrors or ()):
146+
for url in itertools.chain((self.url,), self.mirrors):
147147
try:
148148
download_url(url, str(root), filename=self.file_name, md5=None)
149149
# TODO: make this more precise

0 commit comments

Comments
 (0)