-
Notifications
You must be signed in to change notification settings - Fork 97
Dataframe codec #452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Zarr can store complex dtypes, the pure numpy equivalent of dataframes and probably not what you are after. Zarr can also store groups of one-dimensional columns, which together look like a dataframe. However, the question is why you want to use zarr for this? Zarr is for n-dimensional arrays, and distinguishes itself over parquet by being able to index and chunk in each dimension, whereas parquet is one-dimensional. But your data is one-dimensional. Yes, you could write a codec to store chunks of dataframe in zarr chunks. However, parquet itself is a chunked format, so you don't need zarr to get there. If you want to use parquet already, why not use it by itself? What advantage do you want from zarr? |
So the "naive" idea I would like to explore (whether it's here or in custom code) is to encode a dataframe in Parquet and store the bytes in a Zarr. Now on the reason for why I would like to do that: I am storing a complex dataset and zarr beyond its n-dimensional arrays features, is also very convenient to store literally anything (as long as you have a codec for it). My use case is to store a metadata dataframe (dataframe is simply the most convenient and fastest way for my use case), that will come alongside the "true data" which is a set of n-dimensional arrays. By doing that, we will only have to deal with a single dataset files instead of two (and it makes everything simpler by doing so). Now I totally understand this can be outside of scope of numcodecs and zarr so all good and I can close. |
Is the metadata so small that using the pickle/JSON/msgpack codecs makes sense? Then, your dataframe can look like a one-dimensional object-type array. |
Yes, I will explore JSON/msgpack indeed and it might do the job. But some metadata might be large (never ultra-large likely) so this is why I really liked the Parquet codec. The other reason is more of a "conceptual reason". If going with JSON/msgpack, then I need logic in the code to load and then apply |
Oh, I finally see what you are after :) No, I don't think that zarr itself will ever return anything other than arrays (it's in the name!). Those could be made into dataframes separately. |
Ok I understand and all good to me. Thanks! |
For the record, I crossref that rather old PR zarr-developers/zarr-python#84 (in case anyone is looking into the story of this) |
As you can see, the idea was dropped due to the implementation of parquet IO for pandas with fastparquet. :) |
For the records (again!), saving Parquet bytes in zarr works very well and shows the same performance as working on a raw Parquet file. While not suprising, I needed to check it before moving forward. # Save metadata as Parquet and bytes
df_buffer = io.BytesIO()
metadata.to_parquet(df_buffer)
root.create_dataset("metadata_parquet", data=[df_buffer.getbuffer().tobytes()], dtype=bytes)
# Read from Parquet
df_buffer = io.BytesIO()
df_buffer.write(root["metadata_parquet"][0])
metadata = pd.read_parquet(df_buffer)
assert isinstance(metadata, pd.DataFrame) Sizes on disk are the same and loading times are also the same. The only downside is that by doing Hope it helps! |
Sorry for chiming in late, but I find this personally interesting (i.e. for NGFF):
A question that fairly regularly comes up when is whether or not one starts mixing Parquest files into Zarr files. This is at least a new avenue for consideration.@hadim, thanks for the investigation & (ongoing) record-keeping. |
I am looking for a way to store a dataframe (or many) into a Zarr. It seems this is currently not possible, see zarr-developers/community#31 for context.
I was wondering whether you think a dataframe codec based on Parquet could be useful to have in numcodecs. I am not sure whether it fit the scope of numcodecs though.
The text was updated successfully, but these errors were encountered: