-
Notifications
You must be signed in to change notification settings - Fork 13
Open
Description
It would be great:
- you could select columns for reading from parquet, or, even better, select from the schema hierarchy in general for deeper structured datasets
- you allow reading row-group X from a parquet dataset; this would allow for distributing the work to threads or even a cluster. Of course, the reader would need to reveal how many row-groups it contains
- some to_buffers kind of method exists to expose the internal buffers of an arrow structure, in the order defined in the arrow docs; also the corresponding from_buffers
Doing all of this would essentially answer what is envisaged in dask/fastparquet#931 : getting what we really need out of arrow without the cruft. It would interoperate nicely with awkward
, for example.
Other nice to haves (and I realise you wish to keep the scope as small as possible)
- parquet filter
- str and dt compute functions
Metadata
Metadata
Assignees
Labels
No labels