Skip to content

Potential Malfunctioning of GDriveReader for torchtext datasets (or files with larger sizes in general) #468

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
parmeet opened this issue May 25, 2022 · 7 comments

Comments

@parmeet
Copy link
Contributor

parmeet commented May 25, 2022

🐛 Describe the bug

Currently none of the torchtext datasets with GDrive URL are able to download the files. The reason being is confirm token is alway None.

The logic to download large datasets is borrowed from tensor2tensor library here. This is also proposed in various answers here.

The confirm token comes from response.cookies.items(). Unfortunately, I notice that for all the URLs we have in torchtext this return an empty list.

import requests
session  = requests.Session()
#URL of dbpedia dataset
URL = "https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbQ2Vic1kxMmZZQ1k"
response = session.get(URL, stream=True)
response.cookies.items() #return empty list

Eventually this lead to following error (since response above does not contain "content-disposition")
Internal error: headers don't contain content-disposition.

As a aside (although not relevant to solve current problem) In torchtext earlier we referred this error as Internal error: confirm_token was not found in Google drive link.

I am not sure if something has changed recently regarding confirm token, and wonder if we have any potential workarounds/fix to this problem?

cc: @NivekT , @Nayef211

Versions

Latest from main

@pmeier
Copy link
Contributor

pmeier commented May 26, 2022

GDrive recently introduced a Virus scan warning for large files like the one from the link above. This was solved in pytorch/vision#5645. This is good use case to speed up what we already agreed on in pytorch/vision#6060 (review).

@parmeet
Copy link
Contributor Author

parmeet commented May 26, 2022

Thanks @pmeier for the quick response. I guess the solution here pytorch/vision#5645 could potentially be upstreamed to torchdata to fix this issue? I am curious if there is enough bandwidth and time to do that before release? @ejguan , @NivekT any suggestions here?

@ejguan
Copy link
Contributor

ejguan commented May 26, 2022

There is a PR trying to fix it. #442

@parmeet
Copy link
Contributor Author

parmeet commented May 26, 2022

There is a PR trying to fix it. #442

Oh that's great!! :). Let me know and I will test it out once we land the PR.

@ejguan
Copy link
Contributor

ejguan commented May 26, 2022

It's landed now. Do you want to test it using TorchData main branch? This patch didn't get into nightly today.

@ejguan
Copy link
Contributor

ejguan commented May 26, 2022

Closing it for now. LMK if the Error persists with the patch.

@ejguan ejguan closed this as completed May 26, 2022
@parmeet
Copy link
Contributor Author

parmeet commented May 26, 2022

It's landed now. Do you want to test it using TorchData main branch? This patch didn't get into nightly today.

Thanks @ejguan for letting me know. seems working fine now :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants