-
Notifications
You must be signed in to change notification settings - Fork 382
Supporting range queries #766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm pretty sure this is already well supported in fsspec. https://filesystem-spec.readthedocs.io/en/latest/features.html#file-buffering-and-random-access |
Looks like the |
Interesting thanks. Was searching for "range query", but wasn't really see anything (though maybe I was overlooking something) |
I think because "range" or "range request" is an HTTP-specific term. Fsspec tends to use filesystem-inspired terminology. |
We support fetching parts of a file, often using Range in HTTP, for essentially all implementations. This work via the file-like interface (with optional buffering of various strategies) or via the |
Thanks Martin! That's helpful. So I guess this is one piece of the puzzle. Just need to figure out how this fits with the other pieces |
The crux with caterva, IMO, is passing an fsspec file-like-object to caterva. Unlike blosc, caterva-python wants to manage the i/o itself: you give it a file name (string), and caterva opens / closes the file. So there is currently no way to leverage fsspec's partial read capability. I think we should be looking at h5py as model. h5py somehow manages to allow you to open file-like python objects and then pass these objects down to the lower-level hdf5 c layer. |
related, @jakirkham : zarr cannot currently read portions of a key, specifically for the case where the storage target is not compressed. I believe it can read selective blosc blocks (and zstd, in particular, would be very doable). Such functionality would be very helpful in a number of access patterns. @rabernat : I don't know how h5py achieves this either, exactly, but I assume it must be compiled against the python interpreter and really asks it to call the (dynamic) methods on the objects passed. A similar issue in rasterio: rasterio/rasterio#2141 |
@rjzamora, it sounds like you have done similar things with tabular data loading on GCP recently IIUC. Would be great to hear a bit more about how you accomplished this along with any pointers to relevant PRs 🙂 |
Yes - cudf#9265 was recently merged as a temporary workaround for the fact that cudf cannot seek/read from an fsspec file-like object. Before that PR was merged, cudf would always read the entire remote file into a host memory buffer, even for partial IO. The “simple” workaround was to transfer only the necessary byte ranges into the local buffer (in parallel). Martin’s cat_ranges PR was not used in the cudf change, but it probably will be in the near future. The new If you are working with a library that is able to read/seek from an fsspec file-like object, then the best approach is likely to gather known bytes ranges with |
Thanks for that insight Rick! 😄 Would be interested to see how you approach Dask support 🙂 |
Yup. HDF5 has a notion of 'file drivers' which can be written outside HDF5's own code. It's not that dissimilar to fsspec, but in C. h5py implements a driver to wrap a Python object, you can see the code here: It's not really a great approach, though, because HDF5 isn't really expecting a file driver to call back into a dynamic language and do all kinds of stuff. I think it's pretty unusual in general to use a file driver that's not part of HDF5. We've now got a bunch of warnings in our docs after coming across scenarios where you can cause a segfault or a deadlock using this. |
Not even sure if this makes sense. Still thinking through this myself. So maybe this is just a starting point for this discussion. Though this has been coming up in a few places.
Is there a way with
fsspec
to perform range queries? Or could there be?Basically thinking about this from the Zarr side where we are increasingly interested in being able to select out portions of chunks. For this range queries would be useful for selecting out this portion.
cc @joshmoore @rabernat
Some related discussion in these issues:
The text was updated successfully, but these errors were encountered: