You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
to the datapipe. Unfortunately this leads to a checksum error. This happens because if the input is a bytes, it will taken as the sole item for the hash computation:
If the stream is not seekable, e.g. a HTTP response, there is nothing left for the StreamReader to read after the HashChecker is finished.
We can't control how the stream is iterated. As the code comment implies, __iter__ is chosen since it is a common interface for all streams. However, the chunks returned by it have to be separated by a b"\n". Thus, when iterating over arbitrary binary streams we might read the whole file at once, which defeats the chunked behavior we want.
We read from the stream twice since the data read by the HashChecker is not cached anywhere and StreamReader has to do it all over again.
Since the hash_func can be updated, would it be possible to introduce a cache based on the file name in case we encounter bytes? Something along the lines of
Imagine I have a really large file that I want to read in chunks to avoid memory overflow:
Now, it might be useful to also check the hash of the file. Naively, one could simply attach a
to the datapipe. Unfortunately this leads to a checksum error. This happens because if the input is a
bytes
, it will taken as the sole item for the hash computation:data/torchdata/datapipes/iter/util/hashchecker.py
Lines 73 to 76 in 13b574c
In contrast, if the input is a stream, it will be iterated and fully used for the computation:
data/torchdata/datapipes/iter/util/hashchecker.py
Lines 79 to 82 in 13b574c
Thus, placing the
HashChecker
before theStreamReader
gives the wanted behavior here:However, this has several downsides:
StreamReader
to read after theHashChecker
is finished.__iter__
is chosen since it is a common interface for all streams. However, the chunks returned by it have to be separated by ab"\n"
. Thus, when iterating over arbitrary binary streams we might read the whole file at once, which defeats the chunked behavior we want.HashChecker
is not cached anywhere andStreamReader
has to do it all over again.Since the
hash_func
can be updated, would it be possible to introduce a cache based on the file name in case we encounterbytes
? Something along the lines ofThe text was updated successfully, but these errors were encountered: