-
-
Notifications
You must be signed in to change notification settings - Fork 330
Copy functions #217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copy functions #217
Conversation
Hi @jakirkham, hopefully this addresses several requirements related to copying data, including copying between different zarr stores, and copying data between h5py and zarr. This is the last piece of work I was planning to include in v2.2 before release candidate. If you get a chance to take a look, any comments welcome. |
zarr/convenience.py
Outdated
@@ -348,3 +350,392 @@ def load(store): | |||
elif contains_group(store, path=None): | |||
grp = Group(store=store, path=None) | |||
return LazyLoader(grp) | |||
|
|||
|
|||
class _LogWriter(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is neat. I wonder if we should start using it with other things.
Also might be worth giving this a look.
zarr/convenience.py
Outdated
kws.setdefault('shuffle', True) | ||
else: | ||
# zarr -> zarr; preserve compression options by default | ||
kws.setdefault('compressor', source.compressor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we pick up filters
as well? Also what about things like fill_value
and order
? Anything else like this that we might need to get?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I've add tests and implementation to pass through filters, fill_value and order when copying zarr-to-zarr.
zarr/convenience.py
Outdated
kws.setdefault('compressor', source.compressor) | ||
|
||
# create new dataset in destination | ||
ds = dest.create_dataset(name, shape=source.shape, dtype=source.dtype, **kws) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on raising if dest
is a Datastet/Array? Admittedly this would happen already, but it might be worth providing a cleaner error than AttributeError
and a better message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now raises ValueError if dest is not a group.
zarr/convenience.py
Outdated
|
||
# copy data - N.B., if dest is h5py this will load all data into memory | ||
log('{} -> {}'.format(source.name, ds.name)) | ||
ds[:] = source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth trying read_direct
to avoid memory issues. Not sure how well it works with a non-NumPy array, but it is an easy thing to explore that should help this issue.
Edit: Though I guess we would need read_direct
on our end to ensure we can do the same thing with Zarr Arrays. Admittedly we might have code that is pretty close to that already. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here has been rewritten to copy chunk-by-chunk.
zarr/convenience.py
Outdated
elif root or not shallow: | ||
# copy a group | ||
|
||
# creat new group in destination |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: creat
-> create
zarr/convenience.py
Outdated
# copy a group | ||
|
||
# creat new group in destination | ||
grp = dest.create_group(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on using require_group
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
zarr/convenience.py
Outdated
without_attrs=without_attrs, **create_kws) | ||
|
||
|
||
def tree(grp, expand=False, level=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
zarr/convenience.py
Outdated
# zarr -> h5py; use some vaguely sensible defaults | ||
kws.setdefault('compression', 'gzip') | ||
kws.setdefault('compression_opts', 1) | ||
kws.setdefault('shuffle', True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this seems like the only reasonable choice (besides no compression) to ensure compatibility in a wide array of situations. After all users can choose something different if they would like.
grp : Group | ||
Zarr or h5py group. | ||
expand : bool, optional | ||
Only relevant for HTML representation. If True, tree will be fully expanded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this is a bit off topic and I may have forgotten the answer, do we warn if expand
is set to True
for the text representation? If not, we might want to look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revisiting this, it doesn't look like this would be easy to do as we don't know what representation will be used when expand
is set.
zarr/convenience.py
Outdated
# copy a group | ||
|
||
# creat new group in destination | ||
grp = dest.create_group(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we starting creating groups or datasets/Arrays, we may want to do an initially pass of dest
to make sure there will be no conflicts when writing. That way we avoid doing a partial copy and then erroring out with incomplete state.
zarr/tests/test_convenience.py
Outdated
assert len(dest) == 2 | ||
assert 'foo' in dest | ||
assert 'bar/baz' not in dest | ||
assert 'bar/qux' in dest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to have some error tests. For example, what happens when a Group or value is already present? Do we have any incomplete copies for cases like this or is dest left unchanged?
zarr/tests/test_convenience.py
Outdated
assert a.compressor == spam.compressor | ||
assert_array_equal(a[:], spam[:]) | ||
assert 'foo' not in eggs | ||
assert 'bar' not in eggs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to have some error tests here as well. For example, what happens when a Group or Dataset/Array is already present? Do we have any incomplete copies for cases like this or is dest left unchanged?
Thanks @jakirkham, all great comments.
|
Thanks for tackling this @alimanfoo. Very excited to see this coming together. Also excited to see that 2.2 is fast approaching. Tried to add some comments and questions above. The biggest things on my mind ATM can be summarized in three questions. First how do we ensure that copies of Datasets/Arrays are done in a memory efficient manner (as they whole thing may not fit)? Second how do we protect against copying erroring out part way and leaving the destination in an incomplete state (related how do we test this)? Third how do we handle more complex compression/filtering configurations? The last one may be a bit more of conscious choice between what is built in and what should be up to end users and could benefit from some docs/advice along these lines. |
I have a proposal for how to handle a couple of these points. Re copying without using too much memory, I've modified the array copy code to perform a chunk-by-chunk copy. Re protecting against an incomplete copy, I have added two parameters If you want to check whether a copy will proceed to completion without errors, you can use In [21]: source = zarr.group()
In [22]: dest = zarr.group()
In [23]: source.create_dataset('foo/bar/baz', data=np.arange(100))
Out[23]: <zarr.core.Array '/foo/bar/baz' (100,) int64>
In [24]: source.create_dataset('foo/spam', data=np.arange(1000))
Out[24]: <zarr.core.Array '/foo/spam' (1000,) int64>
In [25]: dest.create_dataset('foo/spam', data=np.arange(1000))
Out[25]: <zarr.core.Array '/foo/spam' (1000,) int64>
In [26]: zarr.copy(source['foo'], dest, log=sys.stdout, dry_run=True)
copy /foo
copy /foo/bar
---------------------------------------------------------------------------
...
ValueError: an object 'spam' already exists in destination '/foo' Alternatively, if you want to replace any arrays in the destination, use In [28]: zarr.copy(source['foo'], dest, log=sys.stdout, if_exists='replace', dry_run=True)
copy /foo
copy /foo/bar
copy /foo/spam (1000,) int64
dry run: 3 copy, 0 skip Another alternative is In [29]: zarr.copy(source['foo'], dest, log=sys.stdout, if_exists='skip', dry_run=True)
copy /foo
copy /foo/bar
skip /foo/spam (1000,) int64
dry run: 2 copy, 1 skip A final alternative is I still have to add more tests covering all these options, but thought I'd post this now to check this all sounds reasonable. |
self.source_h5py = True | ||
self.dest_h5py = True | ||
self.new_source = temp_h5f | ||
self.new_dest = temp_h5f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I overlooked it, but was there a ZarrToZarr case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry, it's the super-class, i.e., TestCopy implements zarr-to-zarr. I could make this more obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
|
||
def temp_h5f(): | ||
fn = tempfile.mktemp() | ||
atexit.register(os.remove, fn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of a silly question, but should we be closing this first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean closing the HDF5 file at exit, before removing the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes probably should. Easiest thing might be to call close() inside the test class tearDown rather than try to handle at exit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
More generally I think your copying proposal is reasonable. Kind of would like to see default behavior perform the dry run first and then copy (if there were no exceptions in the dry run). Not sure if this would be a new option in this context or what. Though I guess this is easy for an end user to do as well. What do you think? |
I started implementing a dry run as part of the copy, but then it seemed less complicated and more flexible if the user can have control over being able to do a dry run with whatever other options they like. Also I like the way rsync works and so have been modelling the options around that. |
Believe that you are right especially given that users are likely dealing with slow storage mediums (e.g. NFS). |
OK, I've added a tutorial section and release notes, I think this is good to go. Happy holidays @jakirkham, I'll merge this in the new year and go for 2.2 release. |
Thanks @alimanfoo. Happy Holidays! Should be able to take another look on the 2nd if you’d like. Though am happy to trust your judgment here. |
docs/tutorial.rst
Outdated
|
||
If you have some data in an HDF5 file and would like to copy some or all of it | ||
into a Zarr group, or vice-versa, the :func:`zarr.convenience.copy` and | ||
:func:`zarr.convenience.copyall` functions can be used. Here's an example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be copy_all
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
docs/tutorial.rst
Outdated
>>> source.close() | ||
|
||
If rather than copying a single group or dataset you would like to copy all | ||
groups and datasets, use :func:`zarr.convenience.copyall`, e.g.:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be copy_all
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
Merging soon if no further comments. |
Thanks @alimanfoo. Meant to give this a closer look today, but I was too slow. 😄 That said, this looked pretty good the last time I went through it. Happy to see it in. |
This PR adds some convenience functions for copying data between zarr groups or stores, or for copying data between h5py and zarr. The following new functions are proposed:
zarr.copy(source, dest, ...)
- copy thesource
h5py or zarr group or array into thedest
h5py or zarr group.zarr.copy_all(source, dest, ...)
- copy all children of thesource
h5py or zarr group into thedest
h5py or zarr group.zarr.copy_store(source, dest, ...)
- copy all (key, value) pairs from thesource
store to thedest
store.Further details and examples are given in the function docstrings.
Resolves #87, resolves #113, resolves #137.
TODO:
POSTPONED:
delete=True/False
option analogous to rsync?