[RFC] Stereo Matching Datasets API

### 🚀 The feature

The proposed feature aims to **extend the current datasets API** with datasets that are geared towards `the task of Stereo Matching`. It's main use case is that of providing  **a unified way for consuming classic Stereo Matching datasets** such as: 

- [Middlebury2014](https://vision.middlebury.edu/stereo/data/scenes2014/)
- [Kitti](http://www.cvlibs.net/datasets/kitti/eval_scene_flow.php?benchmark=stereo)
- [SceneFlow](https://lmb.informatik.uni-freiburg.de/resources/datasets/SceneFlowDatasets.en.html)

Other considered dataset additions are: Sintel, FallingThings, InStereo2K, ETH3D, Holopix50k. A high level preview of the dataset interface would be:

```python3
class StereoMatchingDataset(Dataset):
    def __init__(self, ...):
        # constructor code / dataset specific code

    def __getitem__(self, index: int) -> Tuple[Tensor, Tensor, Tensor, Tensor]:
        # processing code
        # ...

        # imgs: Tuple[Tensor, Tensor] 
        # dispartieis: Tuple[Tensor, Tensor]
        # occlusion_masks: Tuple[Tensor, Tensor]

        if self.transforms is not None:
            imgs, disparities, occlusion_masks = self.transforms(imgs, disparities, occlusion_masks)

        img_left = imgs[0]
        img_right = imgs[1]
        disparity = disparities[0]
        occlusion_mask = occlusion_masks[0]

        return img_left, img_right, disparity, occlusion_mask
```

### Motivation, pitch

This API addition would cut down engineering time required for people that are looking into working on, experimenting, or evaluating Stereo Matching models or that want easy access to stereo image data.

Throughout the literature, recent methods ([1](https://openaccess.thecvf.com/content/CVPR2022/papers/Li_Practical_Stereo_Matching_via_Cascaded_Recurrent_Network_With_Adaptive_Correlation_CVPR_2022_paper.pdf), [2](https://arxiv.org/pdf/2109.07547.pdf)) make use of multiple datasets that all have different formatting or specifications. A unified dataset API would streamline interacting with different datasources at the same time.

### Alternatives

The official repo for [RAFT-Stereo](https://github.com/princeton-vl/RAFT-Stereo/blob/main/core/stereo_datasets.py) provides a similar functionality for the datasets on which the network proposed in the paper was trained / evaluated. The proposed `StereoMatchingDataset` API would be largely similar to it, whilst following idiomatic `torchvision`.

### Additional context

### Stereo Matching task formulation.
**Commonly** throughout the literature the task of stereo matching requires `a reference image` (traditionally left image), `its stereo pair` (traditionally the right image), `the disparity map` (traditionally the left->right disparity) between the two images and `an occlusion / validity mask`  for pixels from the reference image that do not have a correspondent in the stereo pair (traditionally left->right). The proposed API would server data towards the user in the following manner:

### Proposal 1.
```python3
class StereoMatchingDataset(Dataset):
    def __init__(self, ...):
        # constructor code / dataset specific code

    def __getitem__(self, index: int) -> Tuple[Tensor, Tensor, Tensor, Tensor]:
        # processing code
        # ...

        # imgs: Tuple[Tensor, Tensor] 
        # dispartieis: Tuple[Tensor, Tensor]
        # occlusion_masks: Tuple[Tensor, Tensor]

        if self.transforms is not None:
            imgs, disparities, occlusion_masks = self.transforms(imgs, disparities, occlusion_masks)

        img_left = imgs[0]
        img_right = imgs[1]
        disparity = disparities[0]
        occlusion_mask = occlusion_masks[0]

        return img_left, img_right, disparity, occlusion_mask
```

The above interface for data consumption is more aligned with the larger dataset ecosystem in `torchvision` where a dataset provides all the required tensors to perform training. **However**, this approach makes the **assumption** the user / algorithm does not require the right disparity map or the right occlusion mask. An alternative to this assumption would be a modification of the interface such that the user may be able to access the right-channel annotations:

### Proposal 2.
```python3
def __getitem__(self, index: int) -> Tuple[Tuple, Tuple, Tuple]:
        # processing code
        # ...

        # imgs: Tuple[Tensor, Tensor] 
        # dispartieis: Tuple[Tensor, Tensor]
        # occlusion_masks: Tuple[Tensor, Tensor]

        if self.transforms is not None:
            imgs, disparities, occlusion_masks = self.transforms(imgs, disparities, occlusion_masks)

        return imgs, disparities, occlusion_masks
         
# in user land, the API feeds all the available data to the user 
for (imgs, disparities, occlusion_masks) in stereo_dataloader:
        # however, the user becomes responsible to deconstruct the batch in order to get
        # the classic task definition data
        img_l, img_r, disparity, occlusion_mask = imgs[0], imgs[1], disparities[0], occlusion_masks[0]
        # ...
```

User feedback would be highly appreciated as it is highly unlikely one can be aware of all the use-cases / methods in Stereo Matching. Some preliminary pros and cons for each proposal:

### Proposal 1
**_Pros_**:
- Provides **strong guarantees** about the data (not all datasets provide disparities for both views, i.e ETH3D)
- Follows **the common specification** of the Stereo Matching task.
- The user is provided with a **familiar / idiomatic experience** and receives all the necessary data for training with **no need for additional data handling / manipulation**

**_Cons_**:
- It **can restrict data access** to the user with no out of the box way of recovering it (i.e. right disparity maps / occlusion masks). This would render the API unusable for some use-cases (if there are any).

### Proposal 2
**_Pros_**:
- The user **gets all the data** (left / right annotations instead of just left)

**_Cons_**:
- **Breaks away from the standard** of other dataset APIs in `torchvision`.
- **Forces the user to check** his data / data merging strategies (i.e. using ETH3D would yield `None` for the right channel annotations)
- Users need to **manually unpack** the data into `tensors` that are provided to `models / losses`.




cc @pmeier @YosuaMichael

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Stereo Matching Datasets API #6259

🚀 The feature

Motivation, pitch

Alternatives

Additional context

Stereo Matching task formulation.

Proposal 1.

Proposal 2.

Proposal 1

Proposal 2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Stereo Matching Datasets API #6259

Description

🚀 The feature

Motivation, pitch

Alternatives

Additional context

Stereo Matching task formulation.

Proposal 1.

Proposal 2.

Proposal 1

Proposal 2

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions