Skip to content

CelebA download is broken #5705

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pmeier opened this issue Mar 30, 2022 · 12 comments
Closed

CelebA download is broken #5705

pmeier opened this issue Mar 30, 2022 · 12 comments
Assignees

Comments

@pmeier
Copy link
Collaborator

pmeier commented Mar 30, 2022

The download of all CelebA files except identity_CelebA.txt is broken. For example, the URL to download img_align_celeba.zip resolves to https://drive.google.com/uc?id=0B7EVK8r0v71pZjFTYXZWM3FlRnM&export=download. This link is publicly accessible, but you have to be logged into Google. Otherwise you'll see a 404 page.

I'll have a look if it is possible to get a general download link from the ID.

cc @pmeier @YosuaMichael

@pmeier pmeier self-assigned this Mar 30, 2022
@pmeier
Copy link
Collaborator Author

pmeier commented Mar 30, 2022

It seems the author has intentionally restricted the visibility for these files for some reason. If that is true, I think there is no way for us to provide an automatic download.

@pmeier
Copy link
Collaborator Author

pmeier commented May 19, 2022

It seems something on the GDrive side has changed. The link above now gives a 403 (forbidden) instead a 404 page when you are logged. When you are not logged in you will be prompted to do so and afterwards see the same 403 page.

Exporting the link manually gives https://drive.google.com/file/d/0B7EVK8r0v71pZjFTYXZWM3FlRnM/view. While logged in you get transferred to the download page. If you are not logged in, you get the login prompt but land on page stating that access need to be requested.

For other datasets, e.g. Caltech101, both link variants are equivalent:

We prefer the first, because it has less redirects and checks.

In any case it seems that Celeba is no longer automatically downloadable unless you are logged in. Thus, I propose we disable the download functionality preferably before the upcoming release. @NicolasHug Thoughts?

@NicolasHug
Copy link
Member

Thanks for reminding me and for your investigations @pmeier !

What do you mean by "being logged in"? Is it logged in from a browser?

@pmeier
Copy link
Collaborator Author

pmeier commented May 19, 2022

What do you mean by "being logged in"? Is it logged in from a browser?

Yes. In particular, we need the session cookies. As possible way is to ask users to export the cookies to a file from the browser that we can read in. But this far from an "automated approach".

AFAIK, there is no way to login from the command line or through env vars or the like.

@NicolasHug
Copy link
Member

Thanks for the deets. I agree we should deactivate it.

@pmeier
Copy link
Collaborator Author

pmeier commented May 20, 2022

Closed in #6052.

@pmeier pmeier closed this as completed May 20, 2022
@NicolasHug
Copy link
Member

Should we keep it open? Ultimately we'll want to put back the download feature, if the Gdrive becomes available again?

@datumbox
Copy link
Contributor

That's reasonable. Should we also ping the owner (send an email or something) to let them know and ask them if they can open-up or rehost the dataset?

@datumbox datumbox reopened this May 20, 2022
@pmeier
Copy link
Collaborator Author

pmeier commented May 20, 2022

Ultimately we'll want to put back the download feature, if the Gdrive becomes available again?

I don't think it will come back. This is not a limitation by GDrive, but as explained in #5705 (comment) a conscious decision by the author to limit access. I've contacted them twice and asked to revert it, but got no response.

If we want to keep it open, we should have some kind scheduled test or the like if the download is publicly accessible again. Otherwise we'll just forget about this and will have a stale issue. At least I will forget to regularly check the dataset if the author changed permissions.

@datumbox
Copy link
Contributor

Thanks for clarifying Philip. Let me close again then.

@NicolasHug
Copy link
Member

If we're not expecting to put it back, then we might want to deprecate the download parameter instead of keeping it around

@jchwenger
Copy link

Hi there,

It seems the situation is even worse now?

Trying to download it like so:

dataset = dset.CelebA(
    root='datasets/celeba',
    download=True,
)

... results in the following error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-15-75c5cdc7e903>](https://localhost:8080/#) in <cell line: 1>()
----> 1 dataset = dset.CelebA(
      2     root='datasets/celeba',
      3     download=True,
      4 )

2 frames
[/usr/local/lib/python3.10/dist-packages/torchvision/datasets/utils.py](https://localhost:8080/#) in download_file_from_google_drive(file_id, root, filename, md5)
    244 
    245         if api_response == "Quota exceeded":
--> 246             raise RuntimeError(
    247                 f"The daily quota of the file {filename} is exceeded and it "
    248                 f"can't be downloaded. This is a limitation of Google Drive "

RuntimeError: The daily quota of the file img_align_celeba.zip is exceeded and it can't be downloaded. This is a limitation of Google Drive and can only be overcome by trying again later.

It would be really nice if that could be available out of the box somehow. Perhaps indeed disable the download feature, or point to some other way of obtaining the dataset (Huggingface?)?

Thanks for reading in any case!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants