use enums in prototype datasets for demux #5189

pmeier · 2022-01-11T10:24:45Z

Addresses #5115 (comment). I feel like this would be a net negative. It doesn't enhance readability. In contrast, now I need to do an extra step to xref the classifier function with respect to the returned datapipes of the Demultiplexer.

cc @pmeier @bjuncek

facebook-github-bot · 2022-01-11T10:24:52Z

💊 CI failures summary and remediations

As of commit b0be39d (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

NicolasHug · 2022-01-11T10:41:39Z

Thanks for trying this @pmeier !

To clarify my question / suggestion in https://github.com/pytorch/vision/pull/5115/files#r778741734, I was wondering if we could use enums (or hardcoded constants, or something else) precisely to avoid xref-ing.

I would completely agree that there is no need for such structures when the hardcoded values (0, 1, 2...) are only used in one single place, i.e. in the _classify_archive function. My suggestion relates to cases where these constants are used more than once. For example, DTD's _filter_images() still uses a hard-coded value of 2 in this PR, but perhaps using DTDDemux.Images would be clearer (and avoid one xref)?

pmeier · 2022-01-11T10:53:11Z

I would completely agree that there is no need for such structures when the hardcoded values (0, 1, 2...) are only used in one single place, i.e. in the _classify_archive function.

Having them only in one place is by far the most common case. Demultiplexer is usually used to split one large dataset archive into its componets and this is only done once throughout _make_datapipe.

The only exception to this is to generate static information. So in _generate_categories we may (or may not) need different files or only a single stream of files compared to _make_datapipe. The latter is what _filter_images in the DTD dataset does.

Given that someone maintaining a dataset will in 99% of the cases work on _make_datapipe rather than _generate_categories, IMO the xref trade (looking up the enum in the classifier function vs looking up the int value) cannot be justified.

NicolasHug · 2022-01-11T11:13:14Z

What are your thoughts on

def _filter_images(self, data: Tuple[str, Any]) -> bool:
    return self._classify_archive(data) == 2

vs

def _filter_images(self, data: Tuple[str, Any]) -> bool:
    return self._classify_archive(data) == DTDDemux.Images

Would you agree that a non-expert here could potentially wonder where this hard-coded 2 comes from, and what it corresponds to? By non-expert, I mean someone who's not an expert in this new dataset design. I consider myself a non-expert. Would you agree that DTDDemux.Images is easier to search for in a code-base than 2?

Also, what are your thoughts in general on seeing the same hard-coded constant in various parts of the code (e.g. 2), vs seeing the same constant variable e.g. DTDDemux.Images or _IMAGES? Would you agree that one is more robust to changes than the other? Would you agree that one conveys more meaning than the other?

I think I misunderstand what you mean by "looking up the enum in the classifier function".
Are you saying that one needs to lookup what DTDDemux.Images corresponds to when they're reading _classify_archive()? If that's the case then I'm not sure I agree; from my understanding, one benefit of enums is precisely that one does not need to bother about the underlying int value.

pmeier · 2022-01-11T12:20:11Z

one benefit of enums is precisely that one does not need to bother about the underlying int value.

The values returned by the classifier function are not arbitrary, but rather indices that indicate into which datapipe an item gets sorted. For example:

num_dps = 3
dps = Demultiplexer(dp, num_dps, classifier_fn)
assert len(dps) == num_dps

If classifier_fn returns 0, the item will get sorted into dps[0]. The same is true for 1 and dps[1] as well as 2 and dps[2]. If the classifier_fn returns anything but an integer with 0 <= idx < num_dps, the iteration will fail.

Thus,

def _filter_images(self, data: Tuple[str, Any]) -> bool:
    return self._classify_archive(data) == 2

dp = Filter(dp, _filter_images)

is roughly equivalent to

dp = Demultiplexer(dp, 3, self._classify_archive(data))[2]

The difference between the two is that Filter drops items for with the filter function returns False, whereas all items of unused datapipes from Demultiplexer will be kept in memory.

NicolasHug · 2022-01-11T12:43:58Z

Thanks for clarifying @pmeier . I'm still not sure I understand how using an enum adds an additional xref though. When one reads (or write) e.g.

        splits_dp, joint_categories_dp, images_dp = Demultiplexer(
            archive_dp, 3, self._classify_archive, drop_none=True, buffer_size=INFINITE_BUFFER_SIZE
)

they have to know somehow that splits, categories and images correspond respectively to 0, 1, and 2. Purely in terms of xref, whether they look up this info in _filter_images() or in the enum class makes no difference, IMHO.

In any case, I don't believe it's worth spending much more time here. I do believe that it's good practice to avoid hardcoding the same value (arbitrary or not) in different places in the code. But perhaps there are things that you see and that I don't see yet. I'll leave it up to you to decide.

Thanks for opening the PR and for trying my suggestion.

pmeier · 2022-01-11T12:56:27Z

Compare

def classifiy_int(path):
    if path.parent.name == "labels":
        if path.name == "labels_joint_anno.txt":
            return 1

        return 0
    elif path.parents[1].name == "images":
        return 2


splits_dp, joint_categories_dp, images_dp = Demultiplexer(archive_dp, 3, classify_int)

with

def classify_enum(path):
    if path.parent.name == "labels":
        if path.name == "labels_joint_anno.txt":
            return DTDDemux.JOINT_CATEGORIES

        return DTDDemux.SPLITS
    elif path.parents[1].name == "images":
        return DTDDemux.IMAGES


splits_dp, joint_categories_dp, images_dp = Demultiplexer(archive_dp, 3, classify_enum)

In the latter case, you have to look at the actual fields of DTDDemux to know if the mapping between classify_enum and splits_dp, joint_categories_dp, images_dp is correct. This defeats the purpose of enums.

NicolasHug · 2022-01-11T13:04:17Z

Sorry, I disagree with the premise that we need to check classify_enum when we're calling Demultiplexer. All we have to assume is that classify_enum does what it's supposed to do, which is to map splits to splits, categories to categories, and images to images. Assuming that other functions do their job properly is a reasonable assumption.

pmeier · 2022-01-11T13:34:17Z

All we have to assume is that classify_enum does what it's supposed to do, which is to map splits to splits, categories to categories, and images to images.

This assumption is wrong. It does not map element to element, but rather element to index. And you need this index to correctly assign the demultiplexed datapipes.

If you disagree, please tell me how you would assign splits_dp, joint_categories_dp and images_dp in the correct order from Demultiplexer(archive_dp, 3, classify_enum) without looking at the actual values of the enum.

NicolasHug · 2022-01-11T13:39:55Z

It does not map element to element, but rather element to index. And you need this index to correctly assign the demultiplexed datapipes.

Fully agreed, I should have written (addition in bolds):

All we have to assume is that classify_enum does what it's supposed to do, which is to map splits to the splits index, categories to the categories index, and images to the images index

This does not change anything to what I wrote above though.

please tell me how you would assign splits_dp, joint_categories_dp and images_dp in the correct order from Demultiplexer(archive_dp, 3, classify_enum) without looking at the actual values of the enum.

We agree here as well: we have to look at the enum values to call Demultiplexer.

But we don't have to look at classify_enum(). Which is why I don't understand where the "extra" xref comes from.

NicolasHug · 2022-01-11T13:43:04Z

I'll stop commenting on this PR for good now, as I'm not sure we're getting anywhere.

NicolasHug

Thanks a lot @pmeier !

pmeier · 2022-01-11T14:45:15Z

But we don't have to look at classify_enum(). Which is why I don't understand where the "extra" xref comes from.

I finally get what you mean. I've removed the enums from all datasets that do not re-use the values. For the others, I now agree that enums make this more readable.

Summary: * use enums in prototype datasets for demux * use enum for category generation * revert enum usage for single use constants Reviewed By: NicolasHug Differential Revision: D33618173 fbshipit-source-id: a4ab9349905806f2cd0c701c4b59bc1ab0ad14ae

use enums in prototype datasets for demux

9e5556c

pmeier added module: datasets prototype labels Jan 11, 2022

pytorch-probot bot added the ciflow/default label Jan 11, 2022

facebook-github-bot added the cla signed label Jan 11, 2022

pmeier added 2 commits January 11, 2022 15:39

use enum for category generation

0488e71

revert enum usage for single use constants

b0be39d

pmeier requested a review from NicolasHug January 11, 2022 14:43

NicolasHug approved these changes Jan 11, 2022

View reviewed changes

pmeier merged commit 1c63096 into pytorch:main Jan 11, 2022

pmeier deleted the datasets/enum-demux branch January 11, 2022 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

use enums in prototype datasets for demux #5189

use enums in prototype datasets for demux #5189

Uh oh!

pmeier commented Jan 11, 2022 •

edited by pytorch-probot bot

Loading

Uh oh!

facebook-github-bot commented Jan 11, 2022 •

edited

Loading

Uh oh!

NicolasHug commented Jan 11, 2022

Uh oh!

pmeier commented Jan 11, 2022

Uh oh!

NicolasHug commented Jan 11, 2022

Uh oh!

pmeier commented Jan 11, 2022

Uh oh!

NicolasHug commented Jan 11, 2022

Uh oh!

pmeier commented Jan 11, 2022

Uh oh!

NicolasHug commented Jan 11, 2022

Uh oh!

pmeier commented Jan 11, 2022

Uh oh!

NicolasHug commented Jan 11, 2022 •

edited

Loading

Uh oh!

NicolasHug commented Jan 11, 2022

Uh oh!

NicolasHug left a comment

Uh oh!

pmeier commented Jan 11, 2022

Uh oh!

Uh oh!

use enums in prototype datasets for demux #5189

use enums in prototype datasets for demux #5189

Uh oh!

Conversation

pmeier commented Jan 11, 2022 • edited by pytorch-probot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jan 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

NicolasHug commented Jan 11, 2022

Uh oh!

pmeier commented Jan 11, 2022

Uh oh!

NicolasHug commented Jan 11, 2022

Uh oh!

pmeier commented Jan 11, 2022

Uh oh!

NicolasHug commented Jan 11, 2022

Uh oh!

pmeier commented Jan 11, 2022

Uh oh!

NicolasHug commented Jan 11, 2022

Uh oh!

pmeier commented Jan 11, 2022

Uh oh!

NicolasHug commented Jan 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasHug commented Jan 11, 2022

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

pmeier commented Jan 11, 2022

Uh oh!

Uh oh!

pmeier commented Jan 11, 2022 •

edited by pytorch-probot bot

Loading

facebook-github-bot commented Jan 11, 2022 •

edited

Loading

NicolasHug commented Jan 11, 2022 •

edited

Loading