-
Notifications
You must be signed in to change notification settings - Fork 166
Potential Malfunctioning of GDriveReader for torchtext datasets (or files with larger sizes in general) #468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
GDrive recently introduced a Virus scan warning for large files like the one from the link above. This was solved in pytorch/vision#5645. This is good use case to speed up what we already agreed on in pytorch/vision#6060 (review). |
Thanks @pmeier for the quick response. I guess the solution here pytorch/vision#5645 could potentially be upstreamed to torchdata to fix this issue? I am curious if there is enough bandwidth and time to do that before release? @ejguan , @NivekT any suggestions here? |
There is a PR trying to fix it. #442 |
Oh that's great!! :). Let me know and I will test it out once we land the PR. |
It's landed now. Do you want to test it using TorchData main branch? This patch didn't get into nightly today. |
Closing it for now. LMK if the Error persists with the patch. |
Thanks @ejguan for letting me know. seems working fine now :). |
🐛 Describe the bug
Currently none of the torchtext datasets with GDrive URL are able to download the files. The reason being is confirm token is alway None.
The logic to download large datasets is borrowed from tensor2tensor library here. This is also proposed in various answers here.
The confirm token comes from response.cookies.items(). Unfortunately, I notice that for all the URLs we have in torchtext this return an empty list.
Eventually this lead to following error (since response above does not contain "content-disposition")
Internal error: headers don't contain content-disposition.
As a aside (although not relevant to solve current problem) In torchtext earlier we referred this error as
Internal error: confirm_token was not found in Google drive link.
I am not sure if something has changed recently regarding confirm token, and wonder if we have any potential workarounds/fix to this problem?
cc: @NivekT , @Nayef211
Versions
Latest from main
The text was updated successfully, but these errors were encountered: