Skip to content

Copy functions #217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Jan 2, 2018
Merged

Copy functions #217

merged 25 commits into from
Jan 2, 2018

Conversation

alimanfoo
Copy link
Member

@alimanfoo alimanfoo commented Dec 9, 2017

This PR adds some convenience functions for copying data between zarr groups or stores, or for copying data between h5py and zarr. The following new functions are proposed:

  • zarr.copy(source, dest, ...) - copy the source h5py or zarr group or array into the dest h5py or zarr group.
  • zarr.copy_all(source, dest, ...) - copy all children of the source h5py or zarr group into the dest h5py or zarr group.
  • zarr.copy_store(source, dest, ...) - copy all (key, value) pairs from the source store to the dest store.

Further details and examples are given in the function docstrings.

Resolves #87, resolves #113, resolves #137.

TODO:

  • consider using require_group to allow for destination groups already existing
  • test coverage up
  • address all @jakirkham review comments
  • tutorial section
  • release notes

POSTPONED:

  • consider Group.copy() method?
  • consider delete=True/False option analogous to rsync?

@alimanfoo alimanfoo added the in progress Someone is currently working on this label Dec 9, 2017
@alimanfoo alimanfoo added this to the v2.2 milestone Dec 9, 2017
@alimanfoo
Copy link
Member Author

Hi @jakirkham, hopefully this addresses several requirements related to copying data, including copying between different zarr stores, and copying data between h5py and zarr. This is the last piece of work I was planning to include in v2.2 before release candidate. If you get a chance to take a look, any comments welcome.

@alimanfoo alimanfoo added the enhancement New features or improvements label Dec 9, 2017
@@ -348,3 +350,392 @@ def load(store):
elif contains_group(store, path=None):
grp = Group(store=store, path=None)
return LazyLoader(grp)


class _LogWriter(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is neat. I wonder if we should start using it with other things.

Also might be worth giving this a look.

kws.setdefault('shuffle', True)
else:
# zarr -> zarr; preserve compression options by default
kws.setdefault('compressor', source.compressor)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pick up filters as well? Also what about things like fill_value and order? Anything else like this that we might need to get?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I've add tests and implementation to pass through filters, fill_value and order when copying zarr-to-zarr.

kws.setdefault('compressor', source.compressor)

# create new dataset in destination
ds = dest.create_dataset(name, shape=source.shape, dtype=source.dtype, **kws)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on raising if dest is a Datastet/Array? Admittedly this would happen already, but it might be worth providing a cleaner error than AttributeError and a better message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now raises ValueError if dest is not a group.


# copy data - N.B., if dest is h5py this will load all data into memory
log('{} -> {}'.format(source.name, ds.name))
ds[:] = source
Copy link
Member

@jakirkham jakirkham Dec 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth trying read_direct to avoid memory issues. Not sure how well it works with a non-NumPy array, but it is an easy thing to explore that should help this issue.

Edit: Though I guess we would need read_direct on our end to ensure we can do the same thing with Zarr Arrays. Admittedly we might have code that is pretty close to that already. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here has been rewritten to copy chunk-by-chunk.

elif root or not shallow:
# copy a group

# creat new group in destination
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: creat -> create

# copy a group

# creat new group in destination
grp = dest.create_group(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on using require_group instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

without_attrs=without_attrs, **create_kws)


def tree(grp, expand=False, level=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

# zarr -> h5py; use some vaguely sensible defaults
kws.setdefault('compression', 'gzip')
kws.setdefault('compression_opts', 1)
kws.setdefault('shuffle', True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this seems like the only reasonable choice (besides no compression) to ensure compatibility in a wide array of situations. After all users can choose something different if they would like.

grp : Group
Zarr or h5py group.
expand : bool, optional
Only relevant for HTML representation. If True, tree will be fully expanded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this is a bit off topic and I may have forgotten the answer, do we warn if expand is set to True for the text representation? If not, we might want to look into it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisiting this, it doesn't look like this would be easy to do as we don't know what representation will be used when expand is set.

# copy a group

# creat new group in destination
grp = dest.create_group(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we starting creating groups or datasets/Arrays, we may want to do an initially pass of dest to make sure there will be no conflicts when writing. That way we avoid doing a partial copy and then erroring out with incomplete state.

assert len(dest) == 2
assert 'foo' in dest
assert 'bar/baz' not in dest
assert 'bar/qux' in dest
Copy link
Member

@jakirkham jakirkham Dec 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have some error tests. For example, what happens when a Group or value is already present? Do we have any incomplete copies for cases like this or is dest left unchanged?

assert a.compressor == spam.compressor
assert_array_equal(a[:], spam[:])
assert 'foo' not in eggs
assert 'bar' not in eggs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have some error tests here as well. For example, what happens when a Group or Dataset/Array is already present? Do we have any incomplete copies for cases like this or is dest left unchanged?

@alimanfoo
Copy link
Member Author

alimanfoo commented Dec 9, 2017 via email

@jakirkham
Copy link
Member

Thanks for tackling this @alimanfoo. Very excited to see this coming together. Also excited to see that 2.2 is fast approaching.

Tried to add some comments and questions above. The biggest things on my mind ATM can be summarized in three questions. First how do we ensure that copies of Datasets/Arrays are done in a memory efficient manner (as they whole thing may not fit)? Second how do we protect against copying erroring out part way and leaving the destination in an incomplete state (related how do we test this)? Third how do we handle more complex compression/filtering configurations? The last one may be a bit more of conscious choice between what is built in and what should be up to end users and could benefit from some docs/advice along these lines.

@alimanfoo
Copy link
Member Author

I have a proposal for how to handle a couple of these points.

Re copying without using too much memory, I've modified the array copy code to perform a chunk-by-chunk copy.

Re protecting against an incomplete copy, I have added two parameters if_exists and dry_run which can be used in several ways.

If you want to check whether a copy will proceed to completion without errors, you can use if_exists='raise' (default) with dry_run=True. This means an error is raised if an array already exists in the destination, and no actual copying will be done (dry run). E.g.:

In [21]: source = zarr.group()

In [22]: dest = zarr.group()

In [23]: source.create_dataset('foo/bar/baz', data=np.arange(100))
Out[23]: <zarr.core.Array '/foo/bar/baz' (100,) int64>

In [24]: source.create_dataset('foo/spam', data=np.arange(1000))
Out[24]: <zarr.core.Array '/foo/spam' (1000,) int64>

In [25]: dest.create_dataset('foo/spam', data=np.arange(1000))
Out[25]: <zarr.core.Array '/foo/spam' (1000,) int64>

In [26]: zarr.copy(source['foo'], dest, log=sys.stdout, dry_run=True)
copy /foo
copy /foo/bar
---------------------------------------------------------------------------
...
ValueError: an object 'spam' already exists in destination '/foo'

Alternatively, if you want to replace any arrays in the destination, use if_exists='replace', e.g.:

In [28]: zarr.copy(source['foo'], dest, log=sys.stdout, if_exists='replace', dry_run=True)
copy /foo
copy /foo/bar
copy /foo/spam (1000,) int64
dry run: 3 copy, 0 skip

Another alternative is if_exists='skip', which will skip any arrays that exist in the destination:

In [29]: zarr.copy(source['foo'], dest, log=sys.stdout, if_exists='skip', dry_run=True)
copy /foo
copy /foo/bar
skip /foo/spam (1000,) int64
dry run: 2 copy, 1 skip

A final alternative is if_exists='skip_initialized', which will skip any arrays that exist in the destination and which have all chunks initialized. This provides a way to restart a copy that broke in the middle for some reason (e.g., disk full, network error, ...) without having to re-copy any array that was previously fully copied. This mode is only available when copying to zarr, as I can't find any way in h5py to query the number of chunks initialized in an array.

I still have to add more tests covering all these options, but thought I'd post this now to check this all sounds reasonable.

@alimanfoo alimanfoo changed the title WIP copy functions Copy functions Dec 11, 2017
self.source_h5py = True
self.dest_h5py = True
self.new_source = temp_h5f
self.new_dest = temp_h5f
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I overlooked it, but was there a ZarrToZarr case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry, it's the super-class, i.e., TestCopy implements zarr-to-zarr. I could make this more obvious.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.


def temp_h5f():
fn = tempfile.mktemp()
atexit.register(os.remove, fn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of a silly question, but should we be closing this first?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean closing the HDF5 file at exit, before removing the file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes probably should. Easiest thing might be to call close() inside the test class tearDown rather than try to handle at exit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@jakirkham
Copy link
Member

More generally I think your copying proposal is reasonable. Kind of would like to see default behavior perform the dry run first and then copy (if there were no exceptions in the dry run). Not sure if this would be a new option in this context or what. Though I guess this is easy for an end user to do as well. What do you think?

@alimanfoo
Copy link
Member Author

I started implementing a dry run as part of the copy, but then it seemed less complicated and more flexible if the user can have control over being able to do a dry run with whatever other options they like. Also I like the way rsync works and so have been modelling the options around that.

@jakirkham
Copy link
Member

Believe that you are right especially given that users are likely dealing with slow storage mediums (e.g. NFS).

@alimanfoo
Copy link
Member Author

OK, I've added a tutorial section and release notes, I think this is good to go. Happy holidays @jakirkham, I'll merge this in the new year and go for 2.2 release.

@jakirkham
Copy link
Member

Thanks @alimanfoo. Happy Holidays! Should be able to take another look on the 2nd if you’d like. Though am happy to trust your judgment here.


If you have some data in an HDF5 file and would like to copy some or all of it
into a Zarr group, or vice-versa, the :func:`zarr.convenience.copy` and
:func:`zarr.convenience.copyall` functions can be used. Here's an example
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be copy_all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

>>> source.close()

If rather than copying a single group or dataset you would like to copy all
groups and datasets, use :func:`zarr.convenience.copyall`, e.g.::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be copy_all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

@alimanfoo
Copy link
Member Author

Merging soon if no further comments.

@alimanfoo alimanfoo merged commit e25d843 into master Jan 2, 2018
@alimanfoo alimanfoo deleted the copy-20171208 branch January 2, 2018 18:18
@jakirkham
Copy link
Member

Thanks @alimanfoo. Meant to give this a closer look today, but I was too slow. 😄 That said, this looked pretty good the last time I went through it. Happy to see it in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New features or improvements in progress Someone is currently working on this
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Store conversion methods ENH: copy method HDF5 to Zarr
2 participants