Dataframe codec #452

hadim · 2023-07-27T01:38:03Z

I am looking for a way to store a dataframe (or many) into a Zarr. It seems this is currently not possible, see zarr-developers/community#31 for context.

I was wondering whether you think a dataframe codec based on Parquet could be useful to have in numcodecs. I am not sure whether it fit the scope of numcodecs though.

martindurant · 2023-07-27T13:28:07Z

Zarr can store complex dtypes, the pure numpy equivalent of dataframes and probably not what you are after.

Zarr can also store groups of one-dimensional columns, which together look like a dataframe.

However, the question is why you want to use zarr for this? Zarr is for n-dimensional arrays, and distinguishes itself over parquet by being able to index and chunk in each dimension, whereas parquet is one-dimensional. But your data is one-dimensional.

Yes, you could write a codec to store chunks of dataframe in zarr chunks. However, parquet itself is a chunked format, so you don't need zarr to get there. If you want to use parquet already, why not use it by itself? What advantage do you want from zarr?

hadim · 2023-07-27T13:39:46Z

So the "naive" idea I would like to explore (whether it's here or in custom code) is to encode a dataframe in Parquet and store the bytes in a Zarr.

Now on the reason for why I would like to do that: I am storing a complex dataset and zarr beyond its n-dimensional arrays features, is also very convenient to store literally anything (as long as you have a codec for it). My use case is to store a metadata dataframe (dataframe is simply the most convenient and fastest way for my use case), that will come alongside the "true data" which is a set of n-dimensional arrays. By doing that, we will only have to deal with a single dataset files instead of two (and it makes everything simpler by doing so).

Now I totally understand this can be outside of scope of numcodecs and zarr so all good and I can close.

martindurant · 2023-07-27T13:48:10Z

Is the metadata so small that using the pickle/JSON/msgpack codecs makes sense? Then, your dataframe can look like a one-dimensional object-type array.

hadim · 2023-07-27T14:01:15Z

Yes, I will explore JSON/msgpack indeed and it might do the job. But some metadata might be large (never ultra-large likely) so this is why I really liked the Parquet codec.

The other reason is more of a "conceptual reason". If going with JSON/msgpack, then I need logic in the code to load and then apply pd.read_{json|msgpack} which is indeed only a one-liner. But the nice property of having a "dataframe" codec is that you don't even have to think about it and zarr will simply return you directly a dataframe (without having to care about the serialization method used).

martindurant · 2023-07-27T14:03:28Z

Oh, I finally see what you are after :)

No, I don't think that zarr itself will ever return anything other than arrays (it's in the name!). Those could be made into dataframes separately.

hadim · 2023-07-27T14:13:20Z

Ok I understand and all good to me.

Thanks!

hadim · 2023-07-27T14:14:02Z

For the record, I crossref that rather old PR zarr-developers/zarr-python#84 (in case anyone is looking into the story of this)

martindurant · 2023-07-27T14:19:51Z

As you can see, the idea was dropped due to the implementation of parquet IO for pandas with fastparquet. :)

hadim · 2023-07-27T16:32:21Z

For the records (again!), saving Parquet bytes in zarr works very well and shows the same performance as working on a raw Parquet file. While not suprising, I needed to check it before moving forward.

# Save metadata as Parquet and bytes
df_buffer = io.BytesIO()
metadata.to_parquet(df_buffer)
root.create_dataset("metadata_parquet", data=[df_buffer.getbuffer().tobytes()], dtype=bytes)

# Read from Parquet
df_buffer = io.BytesIO()
df_buffer.write(root["metadata_parquet"][0])
metadata = pd.read_parquet(df_buffer)
assert isinstance(metadata, pd.DataFrame)

Sizes on disk are the same and loading times are also the same.

The only downside is that by doing df_buffer.write(root["metadata_parquet"][0]), you are reading the whole file and so loose the ability to "dynamically and conditionally" load only a subset of a Parquet file. I am sure it's possible to enable that, but it's not needed for my use case.

Hope it helps!

joshmoore · 2023-07-29T03:33:36Z

Sorry for chiming in late, but I find this personally interesting (i.e. for NGFF):

for storing tables of URIs (cc: @angelip2303)
for storing geometry information (cc: @LucaMarconato et al.)

A question that fairly regularly comes up when is whether or not one starts mixing Parquest files into Zarr files. This is at least a new avenue for consideration.@hadim, thanks for the investigation & (ongoing) record-keeping.

hadim closed this as completed Jul 27, 2023

flying-sheep mentioned this issue Jul 5, 2024

IO for nullable string arrays scverse/anndata#1558

Merged

3 tasks

kirahowe mentioned this issue Jul 11, 2024

Include a checksum for Zarr archives in the Dataset checksum polaris-hub/polaris#102

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataframe codec #452

Dataframe codec #452

hadim commented Jul 27, 2023

martindurant commented Jul 27, 2023

hadim commented Jul 27, 2023

martindurant commented Jul 27, 2023

hadim commented Jul 27, 2023

martindurant commented Jul 27, 2023

hadim commented Jul 27, 2023

hadim commented Jul 27, 2023

martindurant commented Jul 27, 2023

hadim commented Jul 27, 2023

joshmoore commented Jul 29, 2023

Dataframe codec #452

Dataframe codec #452

Comments

hadim commented Jul 27, 2023

martindurant commented Jul 27, 2023

hadim commented Jul 27, 2023

martindurant commented Jul 27, 2023

hadim commented Jul 27, 2023

martindurant commented Jul 27, 2023

hadim commented Jul 27, 2023

hadim commented Jul 27, 2023

martindurant commented Jul 27, 2023

hadim commented Jul 27, 2023

joshmoore commented Jul 29, 2023