Description
🐛 Describe the bug
Currently none of the torchtext datasets with GDrive URL are able to download the files. The reason being is confirm token is alway None.
The logic to download large datasets is borrowed from tensor2tensor library here. This is also proposed in various answers here.
The confirm token comes from response.cookies.items(). Unfortunately, I notice that for all the URLs we have in torchtext this return an empty list.
import requests
session = requests.Session()
#URL of dbpedia dataset
URL = "https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbQ2Vic1kxMmZZQ1k"
response = session.get(URL, stream=True)
response.cookies.items() #return empty list
Eventually this lead to following error (since response above does not contain "content-disposition")
Internal error: headers don't contain content-disposition.
As a aside (although not relevant to solve current problem) In torchtext earlier we referred this error as Internal error: confirm_token was not found in Google drive link.
I am not sure if something has changed recently regarding confirm token, and wonder if we have any potential workarounds/fix to this problem?
Versions
Latest from main