-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Description
🚀 The feature
The proposed feature aims to extend the current datasets API with datasets that are geared towards the task of Stereo Matching. It's main use case is that of providing a unified way for consuming classic Stereo Matching datasets such as:
Other considered dataset additions are: Sintel, FallingThings, InStereo2K, ETH3D, Holopix50k. A high level preview of the dataset interface would be:
class StereoMatchingDataset(Dataset):
def __init__(self, ...):
# constructor code / dataset specific code
def __getitem__(self, index: int) -> Tuple[Tensor, Tensor, Tensor, Tensor]:
# processing code
# ...
# imgs: Tuple[Tensor, Tensor]
# dispartieis: Tuple[Tensor, Tensor]
# occlusion_masks: Tuple[Tensor, Tensor]
if self.transforms is not None:
imgs, disparities, occlusion_masks = self.transforms(imgs, disparities, occlusion_masks)
img_left = imgs[0]
img_right = imgs[1]
disparity = disparities[0]
occlusion_mask = occlusion_masks[0]
return img_left, img_right, disparity, occlusion_maskMotivation, pitch
This API addition would cut down engineering time required for people that are looking into working on, experimenting, or evaluating Stereo Matching models or that want easy access to stereo image data.
Throughout the literature, recent methods (1, 2) make use of multiple datasets that all have different formatting or specifications. A unified dataset API would streamline interacting with different datasources at the same time.
Alternatives
The official repo for RAFT-Stereo provides a similar functionality for the datasets on which the network proposed in the paper was trained / evaluated. The proposed StereoMatchingDataset API would be largely similar to it, whilst following idiomatic torchvision.
Additional context
Stereo Matching task formulation.
Commonly throughout the literature the task of stereo matching requires a reference image (traditionally left image), its stereo pair (traditionally the right image), the disparity map (traditionally the left->right disparity) between the two images and an occlusion / validity mask for pixels from the reference image that do not have a correspondent in the stereo pair (traditionally left->right). The proposed API would server data towards the user in the following manner:
Proposal 1.
class StereoMatchingDataset(Dataset):
def __init__(self, ...):
# constructor code / dataset specific code
def __getitem__(self, index: int) -> Tuple[Tensor, Tensor, Tensor, Tensor]:
# processing code
# ...
# imgs: Tuple[Tensor, Tensor]
# dispartieis: Tuple[Tensor, Tensor]
# occlusion_masks: Tuple[Tensor, Tensor]
if self.transforms is not None:
imgs, disparities, occlusion_masks = self.transforms(imgs, disparities, occlusion_masks)
img_left = imgs[0]
img_right = imgs[1]
disparity = disparities[0]
occlusion_mask = occlusion_masks[0]
return img_left, img_right, disparity, occlusion_maskThe above interface for data consumption is more aligned with the larger dataset ecosystem in torchvision where a dataset provides all the required tensors to perform training. However, this approach makes the assumption the user / algorithm does not require the right disparity map or the right occlusion mask. An alternative to this assumption would be a modification of the interface such that the user may be able to access the right-channel annotations:
Proposal 2.
def __getitem__(self, index: int) -> Tuple[Tuple, Tuple, Tuple]:
# processing code
# ...
# imgs: Tuple[Tensor, Tensor]
# dispartieis: Tuple[Tensor, Tensor]
# occlusion_masks: Tuple[Tensor, Tensor]
if self.transforms is not None:
imgs, disparities, occlusion_masks = self.transforms(imgs, disparities, occlusion_masks)
return imgs, disparities, occlusion_masks
# in user land, the API feeds all the available data to the user
for (imgs, disparities, occlusion_masks) in stereo_dataloader:
# however, the user becomes responsible to deconstruct the batch in order to get
# the classic task definition data
img_l, img_r, disparity, occlusion_mask = imgs[0], imgs[1], disparities[0], occlusion_masks[0]
# ...User feedback would be highly appreciated as it is highly unlikely one can be aware of all the use-cases / methods in Stereo Matching. Some preliminary pros and cons for each proposal:
Proposal 1
Pros:
- Provides strong guarantees about the data (not all datasets provide disparities for both views, i.e ETH3D)
- Follows the common specification of the Stereo Matching task.
- The user is provided with a familiar / idiomatic experience and receives all the necessary data for training with no need for additional data handling / manipulation
Cons:
- It can restrict data access to the user with no out of the box way of recovering it (i.e. right disparity maps / occlusion masks). This would render the API unusable for some use-cases (if there are any).
Proposal 2
Pros:
- The user gets all the data (left / right annotations instead of just left)
Cons:
- Breaks away from the standard of other dataset APIs in
torchvision. - Forces the user to check his data / data merging strategies (i.e. using ETH3D would yield
Nonefor the right channel annotations) - Users need to manually unpack the data into
tensorsthat are provided tomodels / losses.