Skip to content

Suppressing ZipFile duplication warning #129

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jakirkham opened this issue Feb 24, 2017 · 20 comments
Closed

Suppressing ZipFile duplication warning #129

jakirkham opened this issue Feb 24, 2017 · 20 comments

Comments

@jakirkham
Copy link
Member

jakirkham commented Feb 24, 2017

Python's ZipFile allows writing duplicate files and does this by default. When it writes a duplicate file, it raises a UserWarning. This occurs for each file and is a bit noisy. As there doesn't seem to be a standard way of solving this, would recommend that we simply suppress this warning. Combining this with a resolution to issue ( https://github.com/alimanfoo/zarr/issues/128 ), would ensure that deduplication already occurs so the warning is no longer relevant.

Edit: Added link to Python bug after the fact.

@jakirkham
Copy link
Member Author

@alimanfoo
Copy link
Member

FWIW I think this needs some consideration. If duplicate files are being written into a zip file, and this is happening often, then it is likely that something rather sub-optimal is happening. In the pathological case, a user could be storing many multiples of the actual data for an array without realising, then wonder why the zip file is so large.

Writing directly to a zip file is really only efficient if the array or arrays being stored in the zip store are written only once, and write operations can be perfectly aligned with chunk boundaries, in which case no duplicate chunk files will ever get created. This can be achieved if an array is created with zarr.array(data=data, ...) or with something like z = zarr.zeros(...); z[:] = data, as in either case zarr internally aligns the write operations with chunk boundaries.

If the use case requires that data are written and then overwritten, and/or that write operations cannot be aligned with chunk boundaries, then a better approach is probably to initially store the data using DirectoryStore. Then when all writing has finished, the containing directory can be stored into a zip file using the standard command line zip utility. The resulting zip file can then be read directly by zarr without having to unpack. This should be at least as efficient in general as deduplicating a zip file on close by copying to a new zip file.

@jakirkham
Copy link
Member Author

It's a fair point honestly. Though I really do like the idea of operating on a single file that includes all of the array data. Zip is nice as it is easy to inspect. Though I don't feel wedded to it given the issues I'm already experiencing by playing with it. Is there some other reasonable storage type that we could add to Zarr that wouldn't have these limitations?

@alimanfoo
Copy link
Member

I don't know of anything better. Tar is worse apparently as it doesn't support random access. It would be possible (if a little twisted) to use an HDF5 file, however you'd lose the ability to do multi-threaded reads (which seem to work on a zip store surprisingly). cc @mrocklin.

@mrocklin
Copy link
Contributor

When I looked into this a long while ago I found that yes, there are other single-file compression formats out there that support random access, but none seemed common place. Generally speaking writing variable sized byte blocks into a single file is a hard problem.

Another alternative would be an embedded key-value database. Zict has a MutableMapping for LMDB. https://github.com/dask/zict/blob/master/zict/lmdb.py This would be a single directory rather than a single file, but balances large writes and many small writes well.

@alimanfoo
Copy link
Member

Maybe shelve is an option? It supports the MutableMapping interface so you could probably just use a Shelf as a store without needing to write any new code...

@alimanfoo
Copy link
Member

Ha, @mrocklin you get much kudos for advocating the MutableMapping interface...

Python 3.5.3 | packaged by conda-forge | (default, Jan 23 2017, 19:01:48) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import zarr
>>> zarr.__version__
'2.1.4'
>>> import numpy as np
>>> import shelve
>>> store = shelve.open('shelf')
>>> z = zarr.array(data=np.arange(1000), store=store, chunks=100)
>>> np.all(z[:] == np.arange(1000))
True
>>> sorted(store)
['.zarray', '.zattrs', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

@mrocklin
Copy link
Contributor

Hooray standard interfaces!

@alimanfoo
Copy link
Member

Looks like shelve supports multi-threaded reads...

In [1]: import zarr

In [2]: import shelve

In [4]: store = shelve.open('shelf')

In [5]: z = zarr.zeros(10000000000, dtype='i4', chunks=1000000, store=store, overwrite
   ...: =True)

In [7]: %time z[:] = 42
CPU times: user 18.1 s, sys: 808 ms, total: 18.9 s
Wall time: 5.55 s

In [8]: !ls -lh shelf*
-rw-r--r-- 1 aliman aliman   25 Feb 24 23:29 shelf.bak
-rw-r--r-- 1 aliman aliman 199M Feb 24 23:29 shelf.dat
-rw-r--r-- 1 aliman aliman 259K Feb 24 23:29 shelf.dir

In [9]: z
Out[9]: 
Array((10000000000,), int32, chunks=(1000000,), order=C)
  nbytes: 37.3G; initialized: 10000/10000
  compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
  store: DbfilenameShelf

In [10]: import dask.array as da

In [11]: d = da.from_array(z, chunks=z.chunks)

In [14]: %time d.mean().compute()
CPU times: user 1min 13s, sys: 1.25 s, total: 1min 14s
Wall time: 12.6 s
Out[14]: 42.0

@alimanfoo
Copy link
Member

A BerkeleyDB hash table would probably be another option.

@jakirkham
Copy link
Member Author

The fact that Zarr is using a MutableMapping seems like a very useful thing. Not that I have looked into this at all, but I wonder if there are any Key-Value Stores that would work well here.

@mrocklin
Copy link
Contributor

mrocklin commented Feb 24, 2017 via email

@alimanfoo
Copy link
Member

Yes, any key-value store should be an option.

@jakirkham
Copy link
Member Author

Thanks for the feedback. I'll give this some more thought.

@alimanfoo
Copy link
Member

alimanfoo commented Feb 25, 2017 via email

@jakirkham
Copy link
Member Author

I added a little bit of Python code to zip up the directories after they are written to in such a way as to ensure Zarr can still load them. This is a good enough near term solution for my needs. Would be willing to contribute the utility function or perhaps add another store if there is interest.

@alimanfoo
Copy link
Member

alimanfoo commented Feb 27, 2017 via email

@jakirkham
Copy link
Member Author

Opened issue ( https://github.com/alimanfoo/zarr/issues/137 ) to keep track of this idea.

@jakirkham
Copy link
Member Author

Forgot to mention that create_group, create_dataset, and open_group will add an empty .zattrs entry to start with. Thus if the attributes need to be set or modified afterwards, this will create duplicate .zattrs entries in a Zip file. Have raised issue ( https://github.com/alimanfoo/zarr/issues/121 ) to allow attrs to be specified in these creation functions.

@alimanfoo
Copy link
Member

alimanfoo commented Oct 18, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants