Copy functions #217

alimanfoo · 2017-12-09T00:08:12Z

This PR adds some convenience functions for copying data between zarr groups or stores, or for copying data between h5py and zarr. The following new functions are proposed:

zarr.copy(source, dest, ...) - copy the source h5py or zarr group or array into the dest h5py or zarr group.
zarr.copy_all(source, dest, ...) - copy all children of the source h5py or zarr group into the dest h5py or zarr group.
zarr.copy_store(source, dest, ...) - copy all (key, value) pairs from the source store to the dest store.

Further details and examples are given in the function docstrings.

Resolves #87, resolves #113, resolves #137.

TODO:

consider using require_group to allow for destination groups already existing
test coverage up
address all @jakirkham review comments
tutorial section
release notes

POSTPONED:

consider Group.copy() method?
consider delete=True/False option analogous to rsync?

alimanfoo · 2017-12-09T00:17:16Z

Hi @jakirkham, hopefully this addresses several requirements related to copying data, including copying between different zarr stores, and copying data between h5py and zarr. This is the last piece of work I was planning to include in v2.2 before release candidate. If you get a chance to take a look, any comments welcome.

jakirkham · 2017-12-09T21:15:48Z

zarr/convenience.py

@@ -348,3 +350,392 @@ def load(store):
    elif contains_group(store, path=None):
        grp = Group(store=store, path=None)
        return LazyLoader(grp)
+
+
+class _LogWriter(object):


This is neat. I wonder if we should start using it with other things.

Also might be worth giving this a look.

jakirkham · 2017-12-09T21:22:45Z

zarr/convenience.py

+                kws.setdefault('shuffle', True)
+            else:
+                # zarr -> zarr; preserve compression options by default
+                kws.setdefault('compressor', source.compressor)


Should we pick up filters as well? Also what about things like fill_value and order? Anything else like this that we might need to get?

Good catch, I've add tests and implementation to pass through filters, fill_value and order when copying zarr-to-zarr.

jakirkham · 2017-12-09T21:25:24Z

zarr/convenience.py

+                kws.setdefault('compressor', source.compressor)
+
+        # create new dataset in destination
+        ds = dest.create_dataset(name, shape=source.shape, dtype=source.dtype, **kws)


Thoughts on raising if dest is a Datastet/Array? Admittedly this would happen already, but it might be worth providing a cleaner error than AttributeError and a better message.

Now raises ValueError if dest is not a group.

jakirkham · 2017-12-09T21:27:53Z

zarr/convenience.py

+
+        # copy data - N.B., if dest is h5py this will load all data into memory
+        log('{} -> {}'.format(source.name, ds.name))
+        ds[:] = source


Might be worth trying read_direct to avoid memory issues. Not sure how well it works with a non-NumPy array, but it is an easy thing to explore that should help this issue.

Edit: Though I guess we would need read_direct on our end to ensure we can do the same thing with Zarr Arrays. Admittedly we might have code that is pretty close to that already. Thoughts?

The logic here has been rewritten to copy chunk-by-chunk.

jakirkham · 2017-12-09T21:29:47Z

zarr/convenience.py

+    elif root or not shallow:
+        # copy a group
+
+        # creat new group in destination


nit: creat -> create

jakirkham · 2017-12-09T21:30:16Z

zarr/convenience.py

+        # copy a group
+
+        # creat new group in destination
+        grp = dest.create_group(name)


Thoughts on using require_group instead?

jakirkham · 2017-12-09T21:31:03Z

zarr/convenience.py

+                  without_attrs=without_attrs, **create_kws)
+
+
+def tree(grp, expand=False, level=None):


jakirkham · 2017-12-09T21:34:25Z

zarr/convenience.py

+                # zarr -> h5py; use some vaguely sensible defaults
+                kws.setdefault('compression', 'gzip')
+                kws.setdefault('compression_opts', 1)
+                kws.setdefault('shuffle', True)


Yeah this seems like the only reasonable choice (besides no compression) to ensure compatibility in a wide array of situations. After all users can choose something different if they would like.

jakirkham · 2017-12-09T21:36:31Z

zarr/convenience.py

+    grp : Group
+        Zarr or h5py group.
+    expand : bool, optional
+        Only relevant for HTML representation. If True, tree will be fully expanded.


Sorry this is a bit off topic and I may have forgotten the answer, do we warn if expand is set to True for the text representation? If not, we might want to look into it.

Revisiting this, it doesn't look like this would be easy to do as we don't know what representation will be used when expand is set.

jakirkham · 2017-12-09T21:44:31Z

zarr/convenience.py

+        # copy a group
+
+        # creat new group in destination
+        grp = dest.create_group(name)


Before we starting creating groups or datasets/Arrays, we may want to do an initially pass of dest to make sure there will be no conflicts when writing. That way we avoid doing a partial copy and then erroring out with incomplete state.

jakirkham · 2017-12-09T21:46:06Z

zarr/tests/test_convenience.py

+    assert len(dest) == 2
+    assert 'foo' in dest
+    assert 'bar/baz' not in dest
+    assert 'bar/qux' in dest


Would be good to have some error tests. For example, what happens when a Group or value is already present? Do we have any incomplete copies for cases like this or is dest left unchanged?

jakirkham · 2017-12-09T21:47:25Z

zarr/tests/test_convenience.py

+        assert a.compressor == spam.compressor
+    assert_array_equal(a[:], spam[:])
+    assert 'foo' not in eggs
+    assert 'bar' not in eggs


Would be good to have some error tests here as well. For example, what happens when a Group or Dataset/Array is already present? Do we have any incomplete copies for cases like this or is dest left unchanged?

alimanfoo · 2017-12-09T21:53:11Z

Thanks @jakirkham, all great comments.

jakirkham · 2017-12-09T21:58:13Z

Thanks for tackling this @alimanfoo. Very excited to see this coming together. Also excited to see that 2.2 is fast approaching.

Tried to add some comments and questions above. The biggest things on my mind ATM can be summarized in three questions. First how do we ensure that copies of Datasets/Arrays are done in a memory efficient manner (as they whole thing may not fit)? Second how do we protect against copying erroring out part way and leaving the destination in an incomplete state (related how do we test this)? Third how do we handle more complex compression/filtering configurations? The last one may be a bit more of conscious choice between what is built in and what should be up to end users and could benefit from some docs/advice along these lines.

alimanfoo · 2017-12-11T00:27:16Z

I have a proposal for how to handle a couple of these points.

Re copying without using too much memory, I've modified the array copy code to perform a chunk-by-chunk copy.

Re protecting against an incomplete copy, I have added two parameters if_exists and dry_run which can be used in several ways.

If you want to check whether a copy will proceed to completion without errors, you can use if_exists='raise' (default) with dry_run=True. This means an error is raised if an array already exists in the destination, and no actual copying will be done (dry run). E.g.:

In [21]: source = zarr.group()

In [22]: dest = zarr.group()

In [23]: source.create_dataset('foo/bar/baz', data=np.arange(100))
Out[23]: <zarr.core.Array '/foo/bar/baz' (100,) int64>

In [24]: source.create_dataset('foo/spam', data=np.arange(1000))
Out[24]: <zarr.core.Array '/foo/spam' (1000,) int64>

In [25]: dest.create_dataset('foo/spam', data=np.arange(1000))
Out[25]: <zarr.core.Array '/foo/spam' (1000,) int64>

In [26]: zarr.copy(source['foo'], dest, log=sys.stdout, dry_run=True)
copy /foo
copy /foo/bar
---------------------------------------------------------------------------
...
ValueError: an object 'spam' already exists in destination '/foo'

Alternatively, if you want to replace any arrays in the destination, use if_exists='replace', e.g.:

In [28]: zarr.copy(source['foo'], dest, log=sys.stdout, if_exists='replace', dry_run=True)
copy /foo
copy /foo/bar
copy /foo/spam (1000,) int64
dry run: 3 copy, 0 skip

Another alternative is if_exists='skip', which will skip any arrays that exist in the destination:

In [29]: zarr.copy(source['foo'], dest, log=sys.stdout, if_exists='skip', dry_run=True)
copy /foo
copy /foo/bar
skip /foo/spam (1000,) int64
dry run: 2 copy, 1 skip

A final alternative is if_exists='skip_initialized', which will skip any arrays that exist in the destination and which have all chunks initialized. This provides a way to restart a copy that broke in the middle for some reason (e.g., disk full, network error, ...) without having to re-copy any array that was previously fully copied. This mode is only available when copying to zarr, as I can't find any way in h5py to query the number of chunks initialized in an array.

I still have to add more tests covering all these options, but thought I'd post this now to check this all sounds reasonable.

jakirkham · 2017-12-11T14:54:46Z

zarr/tests/test_convenience.py

+        self.source_h5py = True
+        self.dest_h5py = True
+        self.new_source = temp_h5f
+        self.new_dest = temp_h5f


Sorry if I overlooked it, but was there a ZarrToZarr case?

Yes, sorry, it's the super-class, i.e., TestCopy implements zarr-to-zarr. I could make this more obvious.

jakirkham · 2017-12-11T15:22:52Z

zarr/tests/test_convenience.py

+
+def temp_h5f():
+    fn = tempfile.mktemp()
+    atexit.register(os.remove, fn)


Kind of a silly question, but should we be closing this first?

Do you mean closing the HDF5 file at exit, before removing the file?

Yes probably should. Easiest thing might be to call close() inside the test class tearDown rather than try to handle at exit.

jakirkham · 2017-12-11T15:46:54Z

More generally I think your copying proposal is reasonable. Kind of would like to see default behavior perform the dry run first and then copy (if there were no exceptions in the dry run). Not sure if this would be a new option in this context or what. Though I guess this is easy for an end user to do as well. What do you think?

alimanfoo · 2017-12-11T16:02:01Z

I started implementing a dry run as part of the copy, but then it seemed less complicated and more flexible if the user can have control over being able to do a dry run with whatever other options they like. Also I like the way rsync works and so have been modelling the options around that.

jakirkham · 2017-12-11T16:58:00Z

Believe that you are right especially given that users are likely dealing with slow storage mediums (e.g. NFS).

alimanfoo · 2017-12-23T01:23:23Z

OK, I've added a tutorial section and release notes, I think this is good to go. Happy holidays @jakirkham, I'll merge this in the new year and go for 2.2 release.

jakirkham · 2017-12-23T01:47:41Z

Thanks @alimanfoo. Happy Holidays! Should be able to take another look on the 2nd if you’d like. Though am happy to trust your judgment here.

jakirkham · 2017-12-25T19:39:13Z

docs/tutorial.rst

+
+If you have some data in an HDF5 file and would like to copy some or all of it
+into a Zarr group, or vice-versa, the :func:`zarr.convenience.copy` and
+:func:`zarr.convenience.copyall` functions can be used. Here's an example


Should be copy_all?

Thanks, fixed.

jakirkham · 2017-12-25T19:39:19Z

docs/tutorial.rst

+    >>> source.close()
+
+If rather than copying a single group or dataset you would like to copy all
+groups and datasets, use :func:`zarr.convenience.copyall`, e.g.::


Should be copy_all?

Thanks, fixed.

alimanfoo · 2018-01-02T12:16:43Z

Merging soon if no further comments.

jakirkham · 2018-01-02T19:54:52Z

Thanks @alimanfoo. Meant to give this a closer look today, but I was too slow. 😄 That said, this looked pretty good the last time I went through it. Happy to see it in.

alimanfoo added 7 commits December 8, 2017 09:52

test and implement copy_store

f23d29e

add excludes/includes

187fee6

document copy_store

afd2eef

wip

83e1616

implement copy functions

85859d6

add logging to examples

56cafc3

tweak example

5fb4829

alimanfoo added the in progress Someone is currently working on this label Dec 9, 2017

alimanfoo added 2 commits December 9, 2017 00:11

edit docs

7a95150

add test

1c2ad4f

alimanfoo added this to the v2.2 milestone Dec 9, 2017

alimanfoo added the enhancement New features or improvements label Dec 9, 2017

alimanfoo added 2 commits December 9, 2017 00:20

add h5py to dev requirements

fb64e9b

flake8 [ci skip]

f46bb1c

jakirkham reviewed Dec 9, 2017

View reviewed changes

zarr/convenience.py Outdated

elif root or not shallow:

# copy a group

# creat new group in destination

Copy link

Member

jakirkham Dec 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: creat -> create

jakirkham reviewed Dec 9, 2017

View reviewed changes

zarr/convenience.py Outdated

without_attrs=without_attrs, **create_kws)

def tree(grp, expand=False, level=None):

Copy link

Member

jakirkham Dec 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

jakirkham reviewed Dec 9, 2017

View reviewed changes

refactor tests; add overwrite option

fd8d96a

alimanfoo added 4 commits December 10, 2017 00:36

copy chunk by chunk

6002b0f

tweak error message

c51672f

rework copy_store to use if_exists/dry_run parameters

9f2fb20

rework copy to use if_exists/dry_run parameters

f777b90

alimanfoo mentioned this pull request Dec 11, 2017

Move Zarr to shared GitHub organization #190

Closed

alimanfoo added 3 commits December 11, 2017 11:57

refactor copy tests; coverage up

6e7cb4e

simplify returns

fd1e5df

windows and py2 compat

7d3b38c

alimanfoo changed the title ~~WIP copy functions~~ Copy functions Dec 11, 2017

jakirkham reviewed Dec 11, 2017

View reviewed changes

alimanfoo added 4 commits December 22, 2017 23:56

preserve filters, order, fill_value on copy zarr-to-zarr

95372cc

check dest is group

a32f33f

clean up h5 files

82160a6

document copy functions

f6614b0

jakirkham reviewed Dec 25, 2017

View reviewed changes

fix tutorial copyall -> copy_all [ci skip]

f284794

fix error message; comments [ci skip]

a33a46a

alimanfoo merged commit e25d843 into master Jan 2, 2018

alimanfoo deleted the copy-20171208 branch January 2, 2018 18:18

		without_attrs=without_attrs, **create_kws)


		def tree(grp, expand=False, level=None):

Uh oh!

Copy functions #217

Copy functions #217

Uh oh!

Conversation

alimanfoo commented Dec 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alimanfoo commented Dec 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakirkham Dec 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakirkham Dec 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alimanfoo commented Dec 9, 2017 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakirkham commented Dec 9, 2017

Uh oh!

alimanfoo commented Dec 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakirkham commented Dec 11, 2017

Uh oh!

alimanfoo commented Dec 11, 2017

Uh oh!

jakirkham commented Dec 11, 2017

Uh oh!

alimanfoo commented Dec 23, 2017

Uh oh!

jakirkham commented Dec 23, 2017

alimanfoo commented Dec 9, 2017 •

edited

Loading

jakirkham Dec 9, 2017 •

edited

Loading

jakirkham Dec 9, 2017 •

edited

Loading

alimanfoo commented Dec 9, 2017 via email •

edited

Loading