Skip to content

Supporting range queries #766

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jakirkham opened this issue Sep 22, 2021 · 12 comments
Open

Supporting range queries #766

jakirkham opened this issue Sep 22, 2021 · 12 comments

Comments

@jakirkham
Copy link

Not even sure if this makes sense. Still thinking through this myself. So maybe this is just a starting point for this discussion. Though this has been coming up in a few places.

Is there a way with fsspec to perform range queries? Or could there be?

Basically thinking about this from the Zarr side where we are increasingly interested in being able to select out portions of chunks. For this range queries would be useful for selecting out this portion.

cc @joshmoore @rabernat

Some related discussion in these issues:

@rabernat
Copy link
Contributor

I'm pretty sure this is already well supported in fsspec.

https://filesystem-spec.readthedocs.io/en/latest/features.html#file-buffering-and-random-access

@normanrz
Copy link

@jakirkham
Copy link
Author

Interesting thanks. Was searching for "range query", but wasn't really see anything (though maybe I was overlooking something)

@rabernat
Copy link
Contributor

I think because "range" or "range request" is an HTTP-specific term. Fsspec tends to use filesystem-inspired terminology.

@martindurant
Copy link
Member

We support fetching parts of a file, often using Range in HTTP, for essentially all implementations. This work via the file-like interface (with optional buffering of various strategies) or via the cat top-level method. Newly ( #744 ), you can get multiple ranges from multiple files concurrently, for async backends.

@jakirkham
Copy link
Author

Thanks Martin! That's helpful. So I guess this is one piece of the puzzle. Just need to figure out how this fits with the other pieces

@rabernat
Copy link
Contributor

The crux with caterva, IMO, is passing an fsspec file-like-object to caterva. Unlike blosc, caterva-python wants to manage the i/o itself: you give it a file name (string), and caterva opens / closes the file. So there is currently no way to leverage fsspec's partial read capability.

I think we should be looking at h5py as model. h5py somehow manages to allow you to open file-like python objects and then pass these objects down to the lower-level hdf5 c layer.

@martindurant
Copy link
Member

related, @jakirkham : zarr cannot currently read portions of a key, specifically for the case where the storage target is not compressed. I believe it can read selective blosc blocks (and zstd, in particular, would be very doable). Such functionality would be very helpful in a number of access patterns.

@rabernat : I don't know how h5py achieves this either, exactly, but I assume it must be compiled against the python interpreter and really asks it to call the (dynamic) methods on the objects passed. A similar issue in rasterio: rasterio/rasterio#2141

@jakirkham
Copy link
Author

@rjzamora, it sounds like you have done similar things with tabular data loading on GCP recently IIUC. Would be great to hear a bit more about how you accomplished this along with any pointers to relevant PRs 🙂

@rjzamora
Copy link
Contributor

it sounds like you have done similar things with tabular data loading on GCP recently IIUC

Yes - cudf#9265 was recently merged as a temporary workaround for the fact that cudf cannot seek/read from an fsspec file-like object. Before that PR was merged, cudf would always read the entire remote file into a host memory buffer, even for partial IO. The “simple” workaround was to transfer only the necessary byte ranges into the local buffer (in parallel). Martin’s cat_ranges PR was not used in the cudf change, but it probably will be in the near future. The new cat_ranges API makes it easy to efficiently transfer a specific set of byte ranges with a single line of code. The only logic that the downstream library needs to worry about is the calculation of the specific byte ranges to pass to cat_ranges.

If you are working with a library that is able to read/seek from an fsspec file-like object, then the best approach is likely to gather known bytes ranges with cat_ranges, and then to open the remote file with the new ”parts” caching strategy. Note that I plan to add this optimization to Dask for read_parquet and byte_range-based read_csv.

@jakirkham
Copy link
Author

Thanks for that insight Rick! 😄

Would be interested to see how you approach Dask support 🙂

@takluyver
Copy link
Contributor

I don't know how h5py achieves this either, exactly, but I assume it must be compiled against the python interpreter and really asks it to call the (dynamic) methods on the objects passed.

Yup. HDF5 has a notion of 'file drivers' which can be written outside HDF5's own code. It's not that dissimilar to fsspec, but in C. h5py implements a driver to wrap a Python object, you can see the code here:

https://github.com/h5py/h5py/blob/3c093b37ee935e66bedbbbd97d7996750b0d4246/h5py/h5fd.pyx#L121-L220

It's not really a great approach, though, because HDF5 isn't really expecting a file driver to call back into a dynamic language and do all kinds of stuff. I think it's pretty unusual in general to use a file driver that's not part of HDF5. We've now got a bunch of warnings in our docs after coming across scenarios where you can cause a segfault or a deadlock using this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants