-
-
Notifications
You must be signed in to change notification settings - Fork 329
Performance compared to PyTables #330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @sd2k, I think it's probably because of the different sizes of the chunks along the first dimension. You did not specify the chunk length in pytables, which means it will guess a value based on the
If you create the zarr array with |
Yep, specifying a smaller chunk length resulted in pretty similar performance for the synthetic dataset. I'll try a similar thing on my real data this evening. Thanks @alimanfoo! |
Cool, no worries. FWIW if you are reading 1 row at a time then |
Ace, thanks for the recommendation. I've settled for something low like 4 to strike a bit of a balance and it's looking good! And thanks for writing zarr :) |
I recently found both zarr and PyTables (finally, a stable replacement to using CSVs...) and was wondering if I'm doing something wrong in my choice of chunk shapes here. My data is roughly a 100000 x 20000 int64 array, fairly compressible, and I need to access it from multiple processes (I'm using PyTorch, which spawns multiple workers). I only really need to access a full row at a time, so I've been setting the chunk size to None on the second dimension.
However, my reads seem to be about 30x slower in zarr than in PyTables, despite using the same compressor/filter (blosc-blosclz). I can't quite reproduce this magnitude of difference using synthetic data, but the example below zarr is about 8x slower than PyTables.
Am I doing something wrong, or is this expected?
Output:
zarr/numcodecs/tables installed using conda.
The text was updated successfully, but these errors were encountered: