Skip to content

use enums in prototype datasets for demux #5189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 11, 2022

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Jan 11, 2022

Addresses #5115 (comment). I feel like this would be a net negative. It doesn't enhance readability. In contrast, now I need to do an extra step to xref the classifier function with respect to the returned datapipes of the Demultiplexer.

cc @pmeier @bjuncek

@facebook-github-bot
Copy link

facebook-github-bot commented Jan 11, 2022

💊 CI failures summary and remediations

As of commit b0be39d (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@NicolasHug
Copy link
Member

Thanks for trying this @pmeier !

To clarify my question / suggestion in https://github.com/pytorch/vision/pull/5115/files#r778741734, I was wondering if we could use enums (or hardcoded constants, or something else) precisely to avoid xref-ing.

I would completely agree that there is no need for such structures when the hardcoded values (0, 1, 2...) are only used in one single place, i.e. in the _classify_archive function. My suggestion relates to cases where these constants are used more than once. For example, DTD's _filter_images() still uses a hard-coded value of 2 in this PR, but perhaps using DTDDemux.Images would be clearer (and avoid one xref)?

@pmeier
Copy link
Collaborator Author

pmeier commented Jan 11, 2022

I would completely agree that there is no need for such structures when the hardcoded values (0, 1, 2...) are only used in one single place, i.e. in the _classify_archive function.

Having them only in one place is by far the most common case. Demultiplexer is usually used to split one large dataset archive into its componets and this is only done once throughout _make_datapipe.

The only exception to this is to generate static information. So in _generate_categories we may (or may not) need different files or only a single stream of files compared to _make_datapipe. The latter is what _filter_images in the DTD dataset does.

Given that someone maintaining a dataset will in 99% of the cases work on _make_datapipe rather than _generate_categories, IMO the xref trade (looking up the enum in the classifier function vs looking up the int value) cannot be justified.

@NicolasHug
Copy link
Member

What are your thoughts on

def _filter_images(self, data: Tuple[str, Any]) -> bool:
    return self._classify_archive(data) == 2

vs

def _filter_images(self, data: Tuple[str, Any]) -> bool:
    return self._classify_archive(data) == DTDDemux.Images

Would you agree that a non-expert here could potentially wonder where this hard-coded 2 comes from, and what it corresponds to? By non-expert, I mean someone who's not an expert in this new dataset design. I consider myself a non-expert. Would you agree that DTDDemux.Images is easier to search for in a code-base than 2?

Also, what are your thoughts in general on seeing the same hard-coded constant in various parts of the code (e.g. 2), vs seeing the same constant variable e.g. DTDDemux.Images or _IMAGES? Would you agree that one is more robust to changes than the other? Would you agree that one conveys more meaning than the other?


I think I misunderstand what you mean by "looking up the enum in the classifier function".
Are you saying that one needs to lookup what DTDDemux.Images corresponds to when they're reading _classify_archive()? If that's the case then I'm not sure I agree; from my understanding, one benefit of enums is precisely that one does not need to bother about the underlying int value.

@pmeier
Copy link
Collaborator Author

pmeier commented Jan 11, 2022

one benefit of enums is precisely that one does not need to bother about the underlying int value.

The values returned by the classifier function are not arbitrary, but rather indices that indicate into which datapipe an item gets sorted. For example:

num_dps = 3
dps = Demultiplexer(dp, num_dps, classifier_fn)
assert len(dps) == num_dps

If classifier_fn returns 0, the item will get sorted into dps[0]. The same is true for 1 and dps[1] as well as 2 and dps[2]. If the classifier_fn returns anything but an integer with 0 <= idx < num_dps, the iteration will fail.

Thus,

def _filter_images(self, data: Tuple[str, Any]) -> bool:
    return self._classify_archive(data) == 2

dp = Filter(dp, _filter_images)

is roughly equivalent to

dp = Demultiplexer(dp, 3, self._classify_archive(data))[2]

The difference between the two is that Filter drops items for with the filter function returns False, whereas all items of unused datapipes from Demultiplexer will be kept in memory.

@NicolasHug
Copy link
Member

Thanks for clarifying @pmeier . I'm still not sure I understand how using an enum adds an additional xref though. When one reads (or write) e.g.

        splits_dp, joint_categories_dp, images_dp = Demultiplexer(
            archive_dp, 3, self._classify_archive, drop_none=True, buffer_size=INFINITE_BUFFER_SIZE
)

they have to know somehow that splits, categories and images correspond respectively to 0, 1, and 2. Purely in terms of xref, whether they look up this info in _filter_images() or in the enum class makes no difference, IMHO.

In any case, I don't believe it's worth spending much more time here. I do believe that it's good practice to avoid hardcoding the same value (arbitrary or not) in different places in the code. But perhaps there are things that you see and that I don't see yet. I'll leave it up to you to decide.

Thanks for opening the PR and for trying my suggestion.

@pmeier
Copy link
Collaborator Author

pmeier commented Jan 11, 2022

Compare

def classifiy_int(path):
    if path.parent.name == "labels":
        if path.name == "labels_joint_anno.txt":
            return 1

        return 0
    elif path.parents[1].name == "images":
        return 2


splits_dp, joint_categories_dp, images_dp = Demultiplexer(archive_dp, 3, classify_int)

with

def classify_enum(path):
    if path.parent.name == "labels":
        if path.name == "labels_joint_anno.txt":
            return DTDDemux.JOINT_CATEGORIES

        return DTDDemux.SPLITS
    elif path.parents[1].name == "images":
        return DTDDemux.IMAGES


splits_dp, joint_categories_dp, images_dp = Demultiplexer(archive_dp, 3, classify_enum)

In the latter case, you have to look at the actual fields of DTDDemux to know if the mapping between classify_enum and splits_dp, joint_categories_dp, images_dp is correct. This defeats the purpose of enums.

@NicolasHug
Copy link
Member

Sorry, I disagree with the premise that we need to check classify_enum when we're calling Demultiplexer. All we have to assume is that classify_enum does what it's supposed to do, which is to map splits to splits, categories to categories, and images to images. Assuming that other functions do their job properly is a reasonable assumption.

@pmeier
Copy link
Collaborator Author

pmeier commented Jan 11, 2022

All we have to assume is that classify_enum does what it's supposed to do, which is to map splits to splits, categories to categories, and images to images.

This assumption is wrong. It does not map element to element, but rather element to index. And you need this index to correctly assign the demultiplexed datapipes.

If you disagree, please tell me how you would assign splits_dp, joint_categories_dp and images_dp in the correct order from Demultiplexer(archive_dp, 3, classify_enum) without looking at the actual values of the enum.

@NicolasHug
Copy link
Member

NicolasHug commented Jan 11, 2022

It does not map element to element, but rather element to index. And you need this index to correctly assign the demultiplexed datapipes.

Fully agreed, I should have written (addition in bolds):

All we have to assume is that classify_enum does what it's supposed to do, which is to map splits to the splits index, categories to the categories index, and images to the images index

This does not change anything to what I wrote above though.

please tell me how you would assign splits_dp, joint_categories_dp and images_dp in the correct order from Demultiplexer(archive_dp, 3, classify_enum) without looking at the actual values of the enum.

We agree here as well: we have to look at the enum values to call Demultiplexer.

But we don't have to look at classify_enum(). Which is why I don't understand where the "extra" xref comes from.

@NicolasHug
Copy link
Member

I'll stop commenting on this PR for good now, as I'm not sure we're getting anywhere.

@pmeier pmeier requested a review from NicolasHug January 11, 2022 14:43
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @pmeier !

@pmeier
Copy link
Collaborator Author

pmeier commented Jan 11, 2022

But we don't have to look at classify_enum(). Which is why I don't understand where the "extra" xref comes from.

I finally get what you mean. I've removed the enums from all datasets that do not re-use the values. For the others, I now agree that enums make this more readable.

@pmeier pmeier merged commit 1c63096 into pytorch:main Jan 11, 2022
@pmeier pmeier deleted the datasets/enum-demux branch January 11, 2022 16:03
facebook-github-bot pushed a commit that referenced this pull request Jan 17, 2022
Summary:
* use enums in prototype datasets for demux

* use enum for category generation

* revert enum usage for single use constants

Reviewed By: NicolasHug

Differential Revision: D33618173

fbshipit-source-id: a4ab9349905806f2cd0c701c4b59bc1ab0ad14ae
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants