Skip to content

Add "no compressor" as a compressor #58

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dstansby opened this issue May 13, 2025 · 7 comments · Fixed by #67
Closed

Add "no compressor" as a compressor #58

dstansby opened this issue May 13, 2025 · 7 comments · Fixed by #67
Assignees

Comments

@dstansby
Copy link
Contributor

No description provided.

@K-Meech
Copy link
Contributor

K-Meech commented May 15, 2025

I've added reading/writing a zarr array with no compressor on my current branch, but I'm seeing some odd values for the compression ratio.

For example, running the script below with zarr python v3, gives a compression ratio less than 1! (0.5 compression ratio) The same happens for zarr python v2.

I think the issue here is that our dev image has a shape of 100 x 100 x 100 which doesn't fit exactly into chunks of 64 x 64 x 64. nbytes seems to be using the 100x100x100 shape to calculate i.e. (100 * 100 * 100 * 64)/8 = 8000000, but I guess the real shape of the stored array is 128 x 128 x 128 (as it's two 64 chunks wide) i.e. (128 * 128* 128* 64)/8 = 16777216. This is much closer to the value nbytes_stored gives. @dstansby - is this maybe a bug in zarr-python?

import zarr
import numpy as np
import pathlib
import zarr

image = np.random.rand(100, 100, 100)
store_path = pathlib.Path("tests/tmp/data")

zarr_array = zarr.create_array(
    store=store_path,
    shape=image.shape,
    chunks=(64, 64, 64),
    dtype=image.dtype,
    compressors=None,
    zarr_format=2,
    fill_value=0,
    config={"write_empty_chunks": True},
)
zarr_array[:] = image

nbytes = zarr_array.nbytes 
nbytes_stored = zarr_array.nbytes_stored()
compression_ratio = nbytes / nbytes_stored

print(zarr_array.info_complete())
print("array shape:", zarr_array.shape)
print("array dtype:", zarr_array.dtype)
print("nbytes:", nbytes)
print("nbytes stored:", nbytes_stored)
print("compression ratio:", compression_ratio)

@K-Meech
Copy link
Contributor

K-Meech commented May 15, 2025

I guess it depends on how we're defining compression ratio - if it's the (size of the array if image stored at current size with no compression / size of array as zarr array), then I guess it is possible to have a compression ratio less than 1. In this case, the uncompressed zarr array is really taking up more space than a 100x100x100 array would.

@dstansby
Copy link
Contributor Author

It's certainly a known issue for zarr-python 2: zarr-developers/zarr-python#2174. If it's an issue for zarr-python 3 I'm not sure if that's a known issue, so a reproducible example and issue in zarr-python would be very welcome!

I thought to get around this we were directly measuring the size of the array as the size of the folder that it's written to, instead of relying on nbytes_stored?

@K-Meech
Copy link
Contributor

K-Meech commented May 15, 2025

We're only using that for tensorstore at the moment - zarr-python benchmarks use : compression_ratio = zarr_array.nbytes / zarr_array.nbytes_stored

Even so, Tensorstore is still reporting a compression ratio below 1 (0.5, same as zarr-python). It seems that nbytes_stored isn't the issue here, but more nbytes (as it's using the shape of the image put in 100x100x100 rather than the shape of the final zarr array 128x128x128)?

@K-Meech
Copy link
Contributor

K-Meech commented May 15, 2025

Something like nbytes = (zarr_array.nchunks * zarr_array.chunks[0] * zarr_array.chunks[1] * zarr_array.chunks[2] * zarr_array.dtype.itemsize) seems to work vs zarr_array.size * zarr_array.dtype.itemsize

@dstansby
Copy link
Contributor Author

Ah right, I think that makes sense. Because every chunk has to be full, you always store the first multiple of the chunk size above your actual data size. So for 100 x 100 x 100 it makes sense that not compressing ends up with more bytes on disk, so the compression ratio is < 1.

Since OME-Zarr is used for big data, this effect is less relevant (an extra 100 array elements on the end of say ~2000 array items is a much smaller fraction). So I would say we just update the dev image to be 128 x 128 x 128 exactly, so it doesn't end up with a potentially confusing < 1 compression ratio. Thoughts?

@K-Meech
Copy link
Contributor

K-Meech commented May 15, 2025

Sounds reasonable to me! I'll update the size to 128x128x128 as part of my PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants