Returned "path" of `HTTPReader` and `GDriveReader` diverges #451

pmeier · 2022-05-24T07:34:29Z

HTTPReader returns the URL for the "path"

data/torchdata/datapipes/iter/load/online.py

Line 46 in 13b574c

return url, StreamWrapper(r.raw)

while GDriveReader returns the file name

data/torchdata/datapipes/iter/load/online.py

Line 129 in 13b574c

return filename[0], StreamWrapper(response.raw)

Since OnlineReader determines at runtime whether to call the HTTP or GDrive download

data/torchdata/datapipes/iter/load/online.py

Lines 198 to 203 in 13b574c

    
           parts = urllib.parse.urlparse(url) 
        
           if re.match(r"(drive|docs)[.]google[.]com", parts.netloc): 
        
               yield _get_response_from_google_drive(url, timeout=self.timeout) 
        
           else: 
        
               yield _get_response_from_http(url, timeout=self.timeout)

the "path" of the yielded tuples is impossible to predict:

from torchdata.datapipes.iter import IterableWrapper, OnlineReader

dp = IterableWrapper(
    [
        "https://raw.githubusercontent.com/pytorch/data/main/LICENSE",
        "https://drive.google.com/uc?export=download&id=1GO-BHUYRuvzr1Gtp2_fqXRsr9TIeYbhV",
    ]
)
dp = OnlineReader(dp)

for path, _ in dp:
    print(path)

https://raw.githubusercontent.com/pytorch/data/main/LICENSE
torchvision.txt

We should align the two. My vote is out to align based on the file name. Still, returning the URL could also be useful if redirect logic as discussed in pytorch/vision#6060 (review) is added to the HTTPReader.

The text was updated successfully, but these errors were encountered:

NivekT · 2022-05-24T19:30:40Z

I agree that we should align the two and the default being the file name is likely better than URL.

I wonder if we should add an optional argument which can be either:

a boolean that determines whether it will return the full URL or the file name
a callable that maps URL to whatever the users want, with default being a function that extract the file name (last part of the URL)

I think adding 2 to _get_response_from_http, HttpReader, and OnlineReader makes sense, but it may be over-engineering so I'm open to other suggestions. We probably do not want to change GDriveReader since returning a URL for GDrive is unlikely to be what users want.

cc: @ejguan

ejguan · 2022-05-24T20:58:28Z

I agree that we should align the two and the default being the file name is likely better than URL.

I am not sure about it. If any url contains directory path, idk this is the idea behavior. For example

dp = IterableWrapper(
    [
        "https://abc.com/folder1/file1.txt",
        "https://abc.com/folder2/file1.txt",
    ]
)

Returning filename by default becomes a problem for users as there will be duplicate filenames.
So I would prefer option 2 but not enabled it by default for HttpReader. And, in order to align the behavior within OnlineReader, we might have to provide this default functions. This makes the behavior more deverged.

I do have a question about the use case of OnlineReader. Do we actually have such Dataset held in different remote sources? @pmeier

One potential use case might be multi-model. But, for multi-model with different data sources, they also need different pipeline to run pre-processing. Then, IMO, it makes more sense to have a separate Reader for each data source and run a few operations after each Reader, then combine these pipelines together.

ejguan · 2022-05-25T14:44:44Z

Another thing is the other related DataPipes are returning URL not filename such as FSSpec and IoPath.

I understand Google Drive is special case because the URLs don't contain any file name or file path. So, in order to have an aligned result from OnlineReader, it seems we have to return filename and provide a function to retrieve file path/name for urls not for google drive.

pmeier · 2022-05-26T11:45:18Z

I do have a question about the use case of OnlineReader. Do we actually have such Dataset held in different remote sources?

I don't think so, no. But that shouldn't be the issue TBH. OnlineReader is about convenience. The contract the user enters is "Here is a datapipe of URLs. Please give me back a datapipe of streams of the URL data as well as some information identifying each stream (URL / path / ...)". As a user I'm not interested what is happening in the backend, i.e. if I need a different functionality for downloading from GDrive rather than a plain HTTP object.

This is the same argument I'm making in pytorch/vision#6060 (comment) for loading of archives. In both cases as user I'm willing to trade specific control for convenience.

Note that I'm not saying that we shouldn't have the individual classes. If one for example only wants to perform plain HTTP requests, they can use the HttpReader which has no special handling for GDrive.

ejguan · 2022-05-26T13:46:49Z

Understood about the convenience for users. I am more concern about how to maintain this OnlineReader.

So, to achieve a common ground, we might need to:

Add option for users to define filepath_fn for urls returned by all Readers (disabled by default due to BC)
Add the same option for OnlineReader but provide a default method to return filename. Since we can extend OnlineReader to handle more types of URLs such as S3, do we want to let users to provide a dictionary of functions to extract files based on URL types?
Change the document
Add warning for users the behavior is going to be changed for Readers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Returned "path" of `HTTPReader` and `GDriveReader` diverges #451

Returned "path" of `HTTPReader` and `GDriveReader` diverges #451

pmeier commented May 24, 2022

NivekT commented May 24, 2022 •

edited

Loading

Uh oh!

ejguan commented May 24, 2022

Uh oh!

ejguan commented May 25, 2022

Uh oh!

pmeier commented May 26, 2022

Uh oh!

ejguan commented May 26, 2022

Uh oh!

Returned "path" of HTTPReader and GDriveReader diverges #451

Returned "path" of HTTPReader and GDriveReader diverges #451

Comments

pmeier commented May 24, 2022

NivekT commented May 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ejguan commented May 24, 2022

Uh oh!

ejguan commented May 25, 2022

Uh oh!

pmeier commented May 26, 2022

Uh oh!

ejguan commented May 26, 2022

Uh oh!

Returned "path" of `HTTPReader` and `GDriveReader` diverges #451

Returned "path" of `HTTPReader` and `GDriveReader` diverges #451

NivekT commented May 24, 2022 •

edited

Loading