Skip to content

Potential Malfunctioning of GDriveReader for torchtext datasets (or files with larger sizes in general) #468

Closed
@parmeet

Description

@parmeet

🐛 Describe the bug

Currently none of the torchtext datasets with GDrive URL are able to download the files. The reason being is confirm token is alway None.

The logic to download large datasets is borrowed from tensor2tensor library here. This is also proposed in various answers here.

The confirm token comes from response.cookies.items(). Unfortunately, I notice that for all the URLs we have in torchtext this return an empty list.

import requests
session  = requests.Session()
#URL of dbpedia dataset
URL = "https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbQ2Vic1kxMmZZQ1k"
response = session.get(URL, stream=True)
response.cookies.items() #return empty list

Eventually this lead to following error (since response above does not contain "content-disposition")
Internal error: headers don't contain content-disposition.

As a aside (although not relevant to solve current problem) In torchtext earlier we referred this error as Internal error: confirm_token was not found in Google drive link.

I am not sure if something has changed recently regarding confirm token, and wonder if we have any potential workarounds/fix to this problem?

cc: @NivekT , @Nayef211

Versions

Latest from main

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions