Skip to content

zipfile.BadZipFile exception when downloading WIDERFace (traced to download_file_from_google_drive function) #5615

Closed
@josh-gleason

Description

@josh-gleason

🐛 Describe the bug

While trying to download the WIDERFace dataset using the following code:

from torchvision.datasets.widerface import WIDERFace
w = WIDERFace('.', split='train', download=True)

I ran into the following error

$ python download_wider.py
Traceback (most recent call last):
  File "download_wider.py", line 2, in <module>
    w = WIDERFace('.', split='train', download=True)
  File "/home/josh/venv/lib/python3.7/site-packages/torchvision/datasets/widerface.py", line 72, in __init__
    self.download()
  File "/home/josh/venv/lib/python3.7/site-packages/torchvision/datasets/widerface.py", line 191, in download
    extract_archive(filepath)
  File "/home/josh/venv/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 409, in extract_archive
    extractor(from_path, to_path, compression)
  File "/home/josh/venv/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 284, in _extract_zip
    from_path, "r", compression=_ZIP_COMPRESSION_MAP[compression] if compression else zipfile.ZIP_STORED
  File "/usr/lib/python3.7/zipfile.py", line 1225, in __init__
    self._RealGetContents()
  File "/usr/lib/python3.7/zipfile.py", line 1292, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

Stepping through the code I found that when it reaches torchvision/datasets/utils.py:243:

243:        _save_response_content(itertools.chain((first_chunk,), response_content_generator), fpath)

the value of first_chunk contains:

b'<!DOCTYPE html><html><head><title>Google Drive - Virus scan warning</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><style nonce="lakwxKwRcFErCkEI/ksjXg">/* Copyright 2022 Google Inc. All Rights Reserved. */\n.goog-inline-block{position:relative;display:-moz-inline-box;display:inline-block}* html .goog-inline-block,*:first-child+html .goog-inline-block{display:inline}.goog-link-button{position:relative;color:#15c;text-decoration:underline;cursor:pointer}.goog-link-button-disabled{color:#ccc;text-decoration:none;cursor:default}body{color:#222;font:normal 13px/1.4 arial,sans-serif;margin:0}.grecaptcha-badge{visibility:hidden}.uc-main{padding-top:50px;text-align:center}#uc-dl-icon{display:inline-block;margin-top:16px;padding-right:1em;vertical-align:top}#uc-text{display:inline-block;max-width:68ex;text-align:left}.uc-error-caption,.uc-warning-caption{color:#222;font-size:16px}#uc-download-link{text-decoration:none}.uc-name-size a{color:#15c;text-decoration:none}.uc-name-size a:visited{color:#61c;text-decoration:none}.uc-name-size a:active{color:#d14836;text-decoration:none}.uc-footer{color:#777;font-size:11px;padding-bottom:5ex;padding-top:5ex;text-align:center}.uc-footer a{color:#15c}.uc-footer a:visited{color:#61c}.uc-footer a:active{color:#d14836}.uc-footer-divider{color:#ccc;width:100%}</style><link rel="icon" href="null"/></head><body><div class="uc-main"><div id="uc-dl-icon" class="image-container"><div class="drive-sprite-aux-download-file"></div></div><div id="uc-text"><p class="uc-warning-caption">Google Drive can\'t scan this file for viruses.</p><p class="uc-warning-subcaption"><span class="uc-name-size"><a href="/open?id=15hGDLhsx8bLgLcIRD5DhYt5iBxnjNF1M">WIDER_train.zip</a> (1.4G)</span> is too large for Google to scan for viruses. Would you still like to download this file?</p><form id="downloadForm" action="https://docs.google.com/uc?export=download&amp;id=15hGDLhsx8bLgLcIRD5DhYt5iBxnjNF1M&amp;confirm=t" method="post"><input type="submit" id="uc-download-link" class="goog-inline-block jfk-button jfk-button-action" value="Download anyway"/></form></div></div><div class="uc-footer"><hr class="uc-footer-divider"></div></body></html>'

and this is the only chunk in the response_content_generator as well.

This information is then written to the WIDER_train.zip file (which is supposed to be a zip file) and that causes the BadZipFile exception.


I'm not sure if this is the result of a recent change in google drive or related to my platform, or what. Either way it would be really nice if a more meaningful error could be presented on the BadZipFile file exception. Since this relies on the response from google drive, which is likely to change in the future, wouldn't it be nicer to have a more meaningful error message when things like this happen?

For example, wouldn't it be great if we had an error message that said something meaningful like

Oops, we tried to download the google drive file from

https://docs.google.com/uc?export=download&id=15hGDLhsx8bLgLcIRD5DhYt5iBxnjNF1M

but the file contains unexpected information and we don't know why.

You can download this file manually and place it at ./widerface/WIDER_train.zip to circumvent this issue.

rather than having the user step through the code to find the correct URL and destination path?

Also, why is this being unzipped before the hash is even checked? That seems like a potential security issue.

Versions

Collecting environment information...
PyTorch version: 1.11.0+cpu
Is debug build: False
CUDA used to build PyTorch: Could not collect
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.13.4
Libc version: glibc-2.26

Python version: 3.7.5 (default, Dec  9 2021, 17:04:37)  [GCC 8.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-99-generic-x86_64-with-Ubuntu-18.04-bionic
Is CUDA available: False
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 2080 Ti
GPU 1: NVIDIA GeForce RTX 2080 Ti

Nvidia driver version: 510.47.03
cuDNN version: 8.0.2
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.21.5
[pip3] torch==1.11.0+cpu
[pip3] torchaudio==0.11.0+cpu
[pip3] torchvision==0.12.0+cpu
[conda] Could not collect

cc @pmeier

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions