Supporting range queries #766

jakirkham · 2021-09-22T18:41:34Z

Not even sure if this makes sense. Still thinking through this myself. So maybe this is just a starting point for this discussion. Though this has been coming up in a few places.

Is there a way with fsspec to perform range queries? Or could there be?

Basically thinking about this from the Zarr side where we are increasingly interested in being able to select out portions of chunks. For this range queries would be useful for selecting out this portion.

cc @joshmoore @rabernat

Some related discussion in these issues:

The text was updated successfully, but these errors were encountered:

rabernat · 2021-09-22T18:47:08Z

I'm pretty sure this is already well supported in fsspec.

https://filesystem-spec.readthedocs.io/en/latest/features.html#file-buffering-and-random-access

normanrz · 2021-09-22T18:53:27Z

Looks like the read_block method does that:
https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=read_block#fsspec.spec.AbstractFileSystem.read_block

jakirkham · 2021-09-22T18:54:00Z

Interesting thanks. Was searching for "range query", but wasn't really see anything (though maybe I was overlooking something)

rabernat · 2021-09-22T18:55:31Z

I think because "range" or "range request" is an HTTP-specific term. Fsspec tends to use filesystem-inspired terminology.

martindurant · 2021-09-22T19:02:10Z

We support fetching parts of a file, often using Range in HTTP, for essentially all implementations. This work via the file-like interface (with optional buffering of various strategies) or via the cat top-level method. Newly ( #744 ), you can get multiple ranges from multiple files concurrently, for async backends.

jakirkham · 2021-09-22T19:05:55Z

Thanks Martin! That's helpful. So I guess this is one piece of the puzzle. Just need to figure out how this fits with the other pieces

rabernat · 2021-09-22T19:11:28Z

The crux with caterva, IMO, is passing an fsspec file-like-object to caterva. Unlike blosc, caterva-python wants to manage the i/o itself: you give it a file name (string), and caterva opens / closes the file. So there is currently no way to leverage fsspec's partial read capability.

I think we should be looking at h5py as model. h5py somehow manages to allow you to open file-like python objects and then pass these objects down to the lower-level hdf5 c layer.

martindurant · 2021-09-22T19:46:03Z

related, @jakirkham : zarr cannot currently read portions of a key, specifically for the case where the storage target is not compressed. I believe it can read selective blosc blocks (and zstd, in particular, would be very doable). Such functionality would be very helpful in a number of access patterns.

@rabernat : I don't know how h5py achieves this either, exactly, but I assume it must be compiled against the python interpreter and really asks it to call the (dynamic) methods on the objects passed. A similar issue in rasterio: rasterio/rasterio#2141

jakirkham · 2021-09-24T15:53:04Z

@rjzamora, it sounds like you have done similar things with tabular data loading on GCP recently IIUC. Would be great to hear a bit more about how you accomplished this along with any pointers to relevant PRs 🙂

rjzamora · 2021-09-24T16:28:55Z

it sounds like you have done similar things with tabular data loading on GCP recently IIUC

Yes - cudf#9265 was recently merged as a temporary workaround for the fact that cudf cannot seek/read from an fsspec file-like object. Before that PR was merged, cudf would always read the entire remote file into a host memory buffer, even for partial IO. The “simple” workaround was to transfer only the necessary byte ranges into the local buffer (in parallel). Martin’s cat_ranges PR was not used in the cudf change, but it probably will be in the near future. The new cat_ranges API makes it easy to efficiently transfer a specific set of byte ranges with a single line of code. The only logic that the downstream library needs to worry about is the calculation of the specific byte ranges to pass to cat_ranges.

If you are working with a library that is able to read/seek from an fsspec file-like object, then the best approach is likely to gather known bytes ranges with cat_ranges, and then to open the remote file with the new ”parts” caching strategy. Note that I plan to add this optimization to Dask for read_parquet and byte_range-based read_csv.

jakirkham · 2021-09-24T17:45:26Z

Thanks for that insight Rick! 😄

Would be interested to see how you approach Dask support 🙂

takluyver · 2022-12-20T16:17:25Z

I don't know how h5py achieves this either, exactly, but I assume it must be compiled against the python interpreter and really asks it to call the (dynamic) methods on the objects passed.

Yup. HDF5 has a notion of 'file drivers' which can be written outside HDF5's own code. It's not that dissimilar to fsspec, but in C. h5py implements a driver to wrap a Python object, you can see the code here:

https://github.com/h5py/h5py/blob/3c093b37ee935e66bedbbbd97d7996750b0d4246/h5py/h5fd.pyx#L121-L220

It's not really a great approach, though, because HDF5 isn't really expecting a file driver to call back into a dynamic language and do all kinds of stuff. I think it's pretty unusual in general to use a file driver that's not part of HDF5. We've now got a bunch of warnings in our docs after coming across scenarios where you can cause a segfault or a deadlock using this.

jakirkham mentioned this issue Sep 22, 2021

Caterva inside Zarr zarr-developers/zarr-python#713

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting range queries #766

Supporting range queries #766

jakirkham commented Sep 22, 2021

rabernat commented Sep 22, 2021

normanrz commented Sep 22, 2021

jakirkham commented Sep 22, 2021

rabernat commented Sep 22, 2021

martindurant commented Sep 22, 2021

jakirkham commented Sep 22, 2021

rabernat commented Sep 22, 2021

martindurant commented Sep 22, 2021

jakirkham commented Sep 24, 2021

rjzamora commented Sep 24, 2021

jakirkham commented Sep 24, 2021

takluyver commented Dec 20, 2022

Supporting range queries #766

Supporting range queries #766

Comments

jakirkham commented Sep 22, 2021

rabernat commented Sep 22, 2021

normanrz commented Sep 22, 2021

jakirkham commented Sep 22, 2021

rabernat commented Sep 22, 2021

martindurant commented Sep 22, 2021

jakirkham commented Sep 22, 2021

rabernat commented Sep 22, 2021

martindurant commented Sep 22, 2021

jakirkham commented Sep 24, 2021

rjzamora commented Sep 24, 2021

jakirkham commented Sep 24, 2021

takluyver commented Dec 20, 2022