-
Notifications
You must be signed in to change notification settings - Fork 7.1k
ViTDet object detection + segmentation implementation #7690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/7690
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
I updated this PR so that the implementation more closely resembles the initial implementation of ViT in torchvision. I have also updated the first post accordingly, to avoid unnecessary reading :p The only difference made in this PR now is that (the rest of the todo's still stand) |
I trained a COCO model using ViT-B as backbone with the following command:
And got the following results:
This configuration should get approximately One thing to note is that I trained on a single GPU with batchsize=4, whereas they trained with 64 GPUs (1 image per GPU). I'm not sure what the effect of this is, since I don't have 64 GPUs at my disposal. If someone has the resources to train with batchsize=64, I would be very interested to see how it performs. In the meantime I will try and use this model some more to see if I can improve on these results. |
Is there any update how to fix this? I really would like to have a working VITDet torchvision implementation. |
None that I have found. I modified the implementation to match that of detectron2 (to the point where both networks output the same features, given the same input and a seed for RNG), but the results are surprisingly even worse. I don't have the numbers on hand at the moment, but I will continue to look into this. If you're interested, feel free to give it a go and see what performance you get. |
I'm slowly making progress on this, but I am not completely there yet. Is there still interest in this from the torchvision maintainers to merge this at some point? @pmeier can I ask you for your feedback? Or alternatively can you let me know who best to ask? |
The latest changes did have an impact on the COCO evaluation score:
Though 0.380 still isn't the expected 0.424. I worry that the relative positional embedding in the multihead attention might explain this difference (which is not possible using the Attention layer from torch). The easiest solution would be to implement a custom Attention layer in torchvision, a la detectron2. |
Good news, the accuracy has gone up significantly by changing the attention layer. The main difference should be that it uses a relative positional embedding. The score I am getting on COCO now is:
That 0.421 is awfully close to the reported 0.424 by their paper. I will update the first post with TODO's that are still left for implementation. Considering there seems to be little to no interest in this, I will stop development here as this was all I needed (working ViTDet in torchvision). |
I found some bug in the learning rate decay, with those fixes the results are:
Which for segmentation is identical to the results in the paper, bbox is nearly identical. 🥳 |
Any update on this? Would be really cool to have VitDet in torchvision :) |
If necessary i can rebase this PR, but I haven't heard from any torchvision maintainer yet so I will wait ^^ |
Makes sense. @datumbox @pmeier @NicolasHug @fmassa sry for the ping if you are on holiday! Can someone maybe leave a short comment if this PR has a chance to be considered? Would be really cool to have VitDet in torchvision. |
It would be great to have this model in Torchvision's model zoo. |
Thanks for this great contribution! Any update on the release of this code? |
I am curious to know too :). @NicolasHug apologies for tagging, but you seem to be actively working on torchvision from the Meta AI group. Are you in a position help guide this PR to a merge-able state, or do you know someone who is? I still think it would be a good addition to have. |
@hgaiser I'm really sorry, I appreciate the work, but we've been unable to prioritize model authoring in torchvision for a while. We won't be adding new models in the foreseeable future. I think the best way for you to make this available would be to publish it through |
@NicolasHug thanks for the response, it is what it is :(. |
@hgaiser Please let us know in this thread if you plan to release this work in another hub or repo. That would be awesome. |
At the moment, I don't have any plans to release it anywhere else. To anyone interested, feel free to pick it up. Things might change in the future, but for now I have no time to work on this. |
This PR implements ViTDet, as per #7630 . I needed this implementation regardless of the feedback from torchvision maintainers, but I figured it makes sense to try and merge this upstream. The implementation borrows heavily from the implementation in detectron2. There is still some work to do, but since there is no feedback on whether this will ever be merged I will pause development at this stage.
Discussion points
mask_rcnn.py
file, since they are so much alike. Should I put it in a separatevitdet.py
file instead?Current status
A training with the following command:
Achieves the following result:
The segmentation results are identical to the results from their paper.
Todo's
My main intention with opening this PR is to allow torchvision maintainers to provide their feedback and opinion. @fmassa I'm not sure if you are still working on these things, but I tag you since we worked together on the RetinaNet implementation :).