-
Notifications
You must be signed in to change notification settings - Fork 7.1k
use enums in prototype datasets for demux #5189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit b0be39d (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
Thanks for trying this @pmeier ! To clarify my question / suggestion in https://github.com/pytorch/vision/pull/5115/files#r778741734, I was wondering if we could use enums (or hardcoded constants, or something else) precisely to avoid xref-ing. I would completely agree that there is no need for such structures when the hardcoded values (0, 1, 2...) are only used in one single place, i.e. in the |
Having them only in one place is by far the most common case. The only exception to this is to generate static information. So in Given that someone maintaining a dataset will in 99% of the cases work on |
What are your thoughts on def _filter_images(self, data: Tuple[str, Any]) -> bool:
return self._classify_archive(data) == 2 vs def _filter_images(self, data: Tuple[str, Any]) -> bool:
return self._classify_archive(data) == DTDDemux.Images Would you agree that a non-expert here could potentially wonder where this hard-coded Also, what are your thoughts in general on seeing the same hard-coded constant in various parts of the code (e.g. I think I misunderstand what you mean by "looking up the enum in the classifier function". |
The values returned by the classifier function are not arbitrary, but rather indices that indicate into which datapipe an item gets sorted. For example: num_dps = 3
dps = Demultiplexer(dp, num_dps, classifier_fn)
assert len(dps) == num_dps If Thus, def _filter_images(self, data: Tuple[str, Any]) -> bool:
return self._classify_archive(data) == 2
dp = Filter(dp, _filter_images) is roughly equivalent to dp = Demultiplexer(dp, 3, self._classify_archive(data))[2] The difference between the two is that |
Thanks for clarifying @pmeier . I'm still not sure I understand how using an splits_dp, joint_categories_dp, images_dp = Demultiplexer(
archive_dp, 3, self._classify_archive, drop_none=True, buffer_size=INFINITE_BUFFER_SIZE
) they have to know somehow that splits, categories and images correspond respectively to 0, 1, and 2. Purely in terms of xref, whether they look up this info in In any case, I don't believe it's worth spending much more time here. I do believe that it's good practice to avoid hardcoding the same value (arbitrary or not) in different places in the code. But perhaps there are things that you see and that I don't see yet. I'll leave it up to you to decide. Thanks for opening the PR and for trying my suggestion. |
Compare def classifiy_int(path):
if path.parent.name == "labels":
if path.name == "labels_joint_anno.txt":
return 1
return 0
elif path.parents[1].name == "images":
return 2
splits_dp, joint_categories_dp, images_dp = Demultiplexer(archive_dp, 3, classify_int) with def classify_enum(path):
if path.parent.name == "labels":
if path.name == "labels_joint_anno.txt":
return DTDDemux.JOINT_CATEGORIES
return DTDDemux.SPLITS
elif path.parents[1].name == "images":
return DTDDemux.IMAGES
splits_dp, joint_categories_dp, images_dp = Demultiplexer(archive_dp, 3, classify_enum) In the latter case, you have to look at the actual fields of |
Sorry, I disagree with the premise that we need to check |
This assumption is wrong. It does not map element to element, but rather element to index. And you need this index to correctly assign the demultiplexed datapipes. If you disagree, please tell me how you would assign |
Fully agreed, I should have written (addition in bolds):
This does not change anything to what I wrote above though.
We agree here as well: we have to look at the enum values to call But we don't have to look at |
I'll stop commenting on this PR for good now, as I'm not sure we're getting anywhere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @pmeier !
I finally get what you mean. I've removed the enums from all datasets that do not re-use the values. For the others, I now agree that enums make this more readable. |
Summary: * use enums in prototype datasets for demux * use enum for category generation * revert enum usage for single use constants Reviewed By: NicolasHug Differential Revision: D33618173 fbshipit-source-id: a4ab9349905806f2cd0c701c4b59bc1ab0ad14ae
Addresses #5115 (comment). I feel like this would be a net negative. It doesn't enhance readability. In contrast, now I need to do an extra step to xref the classifier function with respect to the returned datapipes of the
Demultiplexer
.cc @pmeier @bjuncek