Skip to content

Returned "path" of HTTPReader and GDriveReader diverges #451

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pmeier opened this issue May 24, 2022 · 5 comments
Open

Returned "path" of HTTPReader and GDriveReader diverges #451

pmeier opened this issue May 24, 2022 · 5 comments

Comments

@pmeier
Copy link
Contributor

pmeier commented May 24, 2022

HTTPReader returns the URL for the "path"

return url, StreamWrapper(r.raw)

while GDriveReader returns the file name

return filename[0], StreamWrapper(response.raw)

Since OnlineReader determines at runtime whether to call the HTTP or GDrive download

parts = urllib.parse.urlparse(url)
if re.match(r"(drive|docs)[.]google[.]com", parts.netloc):
yield _get_response_from_google_drive(url, timeout=self.timeout)
else:
yield _get_response_from_http(url, timeout=self.timeout)

the "path" of the yielded tuples is impossible to predict:

from torchdata.datapipes.iter import IterableWrapper, OnlineReader

dp = IterableWrapper(
    [
        "https://raw.githubusercontent.com/pytorch/data/main/LICENSE",
        "https://drive.google.com/uc?export=download&id=1GO-BHUYRuvzr1Gtp2_fqXRsr9TIeYbhV",
    ]
)
dp = OnlineReader(dp)

for path, _ in dp:
    print(path)
https://raw.githubusercontent.com/pytorch/data/main/LICENSE
torchvision.txt

We should align the two. My vote is out to align based on the file name. Still, returning the URL could also be useful if redirect logic as discussed in pytorch/vision#6060 (review) is added to the HTTPReader.

@NivekT
Copy link
Contributor

NivekT commented May 24, 2022

I agree that we should align the two and the default being the file name is likely better than URL.

I wonder if we should add an optional argument which can be either:

  1. a boolean that determines whether it will return the full URL or the file name
  2. a callable that maps URL to whatever the users want, with default being a function that extract the file name (last part of the URL)

I think adding 2 to _get_response_from_http, HttpReader, and OnlineReader makes sense, but it may be over-engineering so I'm open to other suggestions. We probably do not want to change GDriveReader since returning a URL for GDrive is unlikely to be what users want.

cc: @ejguan

@ejguan
Copy link
Contributor

ejguan commented May 24, 2022

I agree that we should align the two and the default being the file name is likely better than URL.

I am not sure about it. If any url contains directory path, idk this is the idea behavior. For example

dp = IterableWrapper(
    [
        "https://abc.com/folder1/file1.txt",
        "https://abc.com/folder2/file1.txt",
    ]
)

Returning filename by default becomes a problem for users as there will be duplicate filenames.
So I would prefer option 2 but not enabled it by default for HttpReader. And, in order to align the behavior within OnlineReader, we might have to provide this default functions. This makes the behavior more deverged.

I do have a question about the use case of OnlineReader. Do we actually have such Dataset held in different remote sources? @pmeier

One potential use case might be multi-model. But, for multi-model with different data sources, they also need different pipeline to run pre-processing. Then, IMO, it makes more sense to have a separate Reader for each data source and run a few operations after each Reader, then combine these pipelines together.

@ejguan
Copy link
Contributor

ejguan commented May 25, 2022

Another thing is the other related DataPipes are returning URL not filename such as FSSpec and IoPath.

I understand Google Drive is special case because the URLs don't contain any file name or file path. So, in order to have an aligned result from OnlineReader, it seems we have to return filename and provide a function to retrieve file path/name for urls not for google drive.

@pmeier
Copy link
Contributor Author

pmeier commented May 26, 2022

I do have a question about the use case of OnlineReader. Do we actually have such Dataset held in different remote sources?

I don't think so, no. But that shouldn't be the issue TBH. OnlineReader is about convenience. The contract the user enters is "Here is a datapipe of URLs. Please give me back a datapipe of streams of the URL data as well as some information identifying each stream (URL / path / ...)". As a user I'm not interested what is happening in the backend, i.e. if I need a different functionality for downloading from GDrive rather than a plain HTTP object.

This is the same argument I'm making in pytorch/vision#6060 (comment) for loading of archives. In both cases as user I'm willing to trade specific control for convenience.

Note that I'm not saying that we shouldn't have the individual classes. If one for example only wants to perform plain HTTP requests, they can use the HttpReader which has no special handling for GDrive.

@ejguan
Copy link
Contributor

ejguan commented May 26, 2022

Understood about the convenience for users. I am more concern about how to maintain this OnlineReader.

So, to achieve a common ground, we might need to:

  • Add option for users to define filepath_fn for urls returned by all Readers (disabled by default due to BC)
  • Add the same option for OnlineReader but provide a default method to return filename. Since we can extend OnlineReader to handle more types of URLs such as S3, do we want to let users to provide a dictionary of functions to extract files based on URL types?
  • Change the document
  • Add warning for users the behavior is going to be changed for Readers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants