Skip to content

AsType Filter #96

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 4, 2017
Merged

AsType Filter #96

merged 6 commits into from
Jan 4, 2017

Conversation

jakirkham
Copy link
Member

Fixes https://github.com/alimanfoo/zarr/issues/94

Allows data to be stored with one data type. When read in with this filter, it is automatically converted to another type specified. Based largely off the Delta filter, but tweaked a bit. Added a few tests to make sure this is working correctly.

@jakirkham
Copy link
Member Author

FWICT both based on the tests and playing with it. This does what I would want. Namely store data on disk as one type and read it in as another (e.g. int16 to float32).

enc = np.empty_like(arr, dtype=self.dtype)

# Convert and copy
enc[...] = arr.astype(self.astype)
Copy link
Member

@alimanfoo alimanfoo Dec 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be enc[...] = arr.astype(self.dtype).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed.

@alimanfoo
Copy link
Member

I'm not sure this will work as intended with just this filter added to the array filters. I think the problem is that within Array.__getitem__ the output numpy array is set up assuming the dtype from the zarr array metadata. So even if the filters do a dtype conversion, the actual numpy array that will get returned from __getitem__ will generally be of the same dtype as the parent zarr array.

This is why I think you also need to create a view, because you need to also modify the dtype attribute on the parent zarr array, as well as adding in the filter to perform the astype conversion.

I think that if you were to call the view() method on the zarr Array class, adding in the extra AsType filter to the list of filters, and changing the dtype, you should get the desired behaviour. I.e., something like:

z1 = zarr.zeros(1000, chunks=100, dtype='i2')
z2 = z1.view(filters=[zarr.AsType('i2', 'f4')], dtype='f4')

However I haven't tested this, there may be other issues we'd need to resolve to make this work.


def astype(self, dtype):
filt = AsType(dtype=self._dtype, astype=dtype)
return self.view(filters=[filt], dtype=dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will work if self has no filters, but I think would need to check if there are any filters on the self array, and if so make a new filters list by copying the existing list and prepending the new AsType filter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. That would explain the test failure I was seeing. 😆

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now fixed in the latest changes.

@jakirkham
Copy link
Member Author

Thanks for the suggestion above about using view. Was looking at that a little bit last night, but wasn't sure whether I should be going that route or not. The nudge in the right direction helped.

Figured that wrapping it in an astype method should make the interaction with the filters a little nicer and a little more accessible. It's not a one-to-one of the h5py astype function, but I'm not sure that it is desirable for it to be. This is a bit more like NumPy's astype so hopefully a bit more familiar to end users.

@alimanfoo
Copy link
Member

alimanfoo commented Dec 7, 2016 via email

@jakirkham
Copy link
Member Author

Alright, finishing a couple more things ATM. Almost done.

@alimanfoo
Copy link
Member

No rush, whenever you're happy.


filters = []
if self._filters:
filters.extend(self._filters)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry had just realized _filters could be None. So had to update this like so. Please let me know if this is ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok to me.

astype = np.dtype(np.float32)

store = dict()
init_array(store, shape=shape, chunks=10, dtype=dtype)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this test to handle the case where self._filters is None.


data = np.arange(np.prod(shape), dtype=dtype).reshape(shape)

z1 = self.create_array(shape=shape, chunks=chunks, dtype=dtype)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this case for when there are other filters.

----------
dtype : string or dtype
NumPy dtype.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So one could technically write to disk when using astype. I don't know if we should add a note or warning about this. Maybe something similar to what view has. Please let me know if you have any thoughts on this.

@jakirkham
Copy link
Member Author

jakirkham commented Dec 7, 2016

Cool.

Sorry had found a bug when I saw you post that. So wanted to make sure I got it fixed first. Have since fixed it and added a test for that case as noted in the diff comments.

I'm happy with this as is. Just not sure if we should warn people about using astype much in the same way as there is a note about view. Otherwise this seems to be ok.

Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. May need some discussion around the init arguments.

If `astype` is an integer data type, please ensure that it is
sufficiently large to store encoded values. No checks are made and data
may become corrupted due to integer overflow if `astype` is too small.
Note also that the encoded data for each chunk includes the absolute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence beginning "Note also..." comes from the DeltaFilter IIRC and shouldn't apply here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped.


Notes
-----
If `astype` is an integer data type, please ensure that it is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this note applies here, at least not exactly. Probably what is more relevant is the casting mode used within numpy.astype, as it is possible to use different casting modes, at least some of which would prevent anything like integer overflow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this was copy-pasted. I think we should have some note, but not the one here currently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with a better note.


codec_id = 'astype'

def __init__(self, dtype, astype=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth adding a casting argument here and storing that for use later within the call to numpy.astype.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about something like:

Notes
-----
This method returns a new Array object which is a view on the same underlying 
chunk data. Modifying any data via the view will also modify data in the original 
array. This is an experimental feature and care may be needed to avoid data 
corruption or unexpected loss of precision.

See Also
--------
Array.view

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So at least initially I like the idea of casting as an argument. However, my concern is this makes conversion asymmetric, but in a way that might not be obvious.

For example, say we have some int initially and we want to use astype to convert it to some float. No matter what conversion I chose the conversion from int to float works fine. However, if one tries to set a value in the view, this won't work unless the casting is unsafe.

We can come up with other types where a different casting constraint sets this threshold of whether it proceeds or not. Though instead of getting an unclear error about type conversion, maybe the better parameter instead of casting is simply whether we make the view read_only or not.

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I took most of the original message, but have tweaked it a bit as I have disabled writing on the view provided by astype.

enc = np.empty_like(arr, dtype=self.dtype)

# Convert and copy
enc[...] = arr.astype(self.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you had a casting property you could use that here, i.e., arr.astype(self.dtype, casting=self.casting).

copy_needed = True

# Convert and copy
dec[...] = enc.astype(self.astype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly could do enc.astype(self.astype, self.casting.

config['id'] = self.codec_id
config['dtype'] = self.dtype.str
config['astype'] = self.astype.str
return config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you did add a casting property, would need to also include it in the config.


filters = []
if self._filters:
filters.extend(self._filters)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok to me.

dtype : dtype
Data type used to store the data.
astype : dtype, optional
Data type of the data read in.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On reflection there may be potential for some confusion here. In this filter currently dtype is the data type used for encoded data, and astype is the data type used for decoded data. This has come about naturally because the primary use case was decoding. However, these argument names are used the other way around on other filters, e.g., from the Delta filter::

    Parameters
    ----------
    dtype : dtype
        Data type to use for decoded data.
    astype : dtype, optional
        Data type to use for encoded data.

I think I'd suggest switching these two arguments around for consistency with other filters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH when I saw Delta. The naming felt unclear and a bit confusing. That being said, I completely agree that having this opposite to Delta doesn't help either. Personally, I'd rather change the names of these to something else entirely like from_dtype and to_dtype or perhaps data_dtype and view_dtype. Though any names that make it clear what these dtypes are applied seem ok IMHO. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to encode_dtype and decode_dtype.

@jakirkham
Copy link
Member Author

So I responded in the comments to some of the big points. I'll try to summarize it here.

The names of the type arguments are not great in either order. Instead of flipping them, I'd prefer we come up with clearer names. This should avoid confusing alternatives to other filters or confusing what is going on with this filter.

Casting is an interesting idea. Though it mainly seems to set a constraint on backpropagation of data. Perhaps instead we should expose the read_only flag and let users set that. We could also default it to True, which avoids some surprises out of the box.

@alimanfoo
Copy link
Member

alimanfoo commented Dec 8, 2016 via email

@jakirkham
Copy link
Member Author

Ok, so I have renamed the types to encode_dtype and decode_dtype for just AsType. Does that work?

FWIW I'd rather not change __init__ param names for other filter classes

Of course, wasn't planning on it. Though maybe they can be updated at some future date when a breaking change is planned.

I'm not sure if this is helping at all, I guess just trying to explain these two different points of view.

Yes, it makes sense. Codecs in general focus on how to store new data; whereas, with astype we are thinking more about how to work with existing data.

Regarding casting...But it might be worth thinking ahead a bit here and doing a bit of future-proofing.

Yeah, I don't claim to have the right answer with casting. Just wanted to point out that astype in this context is a bit different than NumPy's because we are keeping a view of the old data.

My inclination would be to leave it alone until we get a better idea of how people wish to use this. We can even add a note that this method is experimental and subject to change in the future.

Separately it might be worth adding a read_only argument to Array.astype() so the user can ask for a read-only view.

Sure this seems totally reasonable. One could even go so far as setting read_only=True by default. Should avoid some surprises.

filters.extend(self._filters)
filters.insert(0, AsType(encode_dtype=self._dtype, decode_dtype=dtype))

return self.view(filters=filters, dtype=dtype, read_only=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I'm thinking about it. Honestly it seems pretty reasonable to me to just set read_only to True for now on the view. I am having trouble seeing a good use case for using this to write to disk. Especially as there are easier and clearer ways to do this. If someone asks, it is pretty easy to relax this constraint later. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add this avoids some of the concerns raised about using casting as this allows it to have the typical meaning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the hiatus. I'm very happy to set read_only=True by default.


codec_id = 'astype'

def __init__(self, encode_dtype, decode_dtype=None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to question my choice to leave decode_dtype as None by default. We were borrowing from Delta before, but Delta still does something if these are the same. However, for AsType this is an identity in that case. May try to drop this default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@jakirkham
Copy link
Member Author

So I have convinced myself that making the view returned by astype simply read-only is a good thing. This is because of all the confusing and concerning points raised in our discussion when considering write support. Also if one wants to write data with a different type, there is already a good mechanism for that. Namely one can convert their data to the type used by the Zarr Array and write it in. This even still works when considering Zarr Arrays and this version of the astype function. Please let me know your thoughts on this.

If we agree on astype being read-only always, then I see no harm in adding a casting argument that behaves exactly the same as NumPy's. Further it's meaning should be just as clear. Looking at NumPy's other arguments for astype, most of them (e.g. subok and copy) really don't make sense here. So I suspect we won't want to support them.

However, the order argument actually does make sense particularly with Zarr. That said, I'm a little unsure if there is more than say just passing the order argument through. Also on the filter front, I could see value in that being its own filter independent of AsType (maybe AsType should be renamed or not?). The other reason I bring order up is this argument comes before casting in NumPy's astype method. So I don't know how this impacts support of casting near term. Would be curious to hear your thoughts on this as well.

ref: https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.astype.html

@alimanfoo
Copy link
Member

alimanfoo commented Dec 15, 2016 via email

Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies again for the hiatus. A few minor suggestions but looks good.

Parameters
----------
encode_dtype : dtype
Data type used to store the data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to change this to "Data type to use for encoded data."

encode_dtype : dtype
Data type used to store the data.
decode_dtype : dtype, optional
Data type of the data read in.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to change this to "Data type to use for decoded data."

>>> z = f.encode(y)
>>> z
array([100, 102, 104, 106, 108, 110, 112, 114, 116, 118], dtype=int8)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this example, can I suggest to use 'i4' instead of 'i8'. It's just that that 'i8' and 'int8' look similar and so there is potential for confusion for users less familiar with NumPy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we could just use np.int8 instead of i1 and such to avoid any confusion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went ahead and switch to np.int8 and the like. Please let me know if this is acceptable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's great.

enc = np.empty_like(arr, dtype=self.encode_dtype)

# Convert and copy
enc[...] = arr.astype(self.encode_dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might mean two copies are being done here? I.e., could you remove a copy if you replaced lines 534 and 537 with just:

enc = arr.astype(self.encode_dtype)

...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you just mean we don't need the np.empty call above? Agreed. Will drop it.


# handle output
if copy_needed:
out = _buffer_copy(dec, out)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is an opportunity to remove a data copy here if out is None but can't immediately see the best way right now. Not essential but maybe worth it if straightforward to do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't that happen already? We don't change out here so I think that just works, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, the logic is a bit tortuous here, I'm having trouble myself reconstructing everything after not looking at it for a while. I think this can actually be simplified as follows (which will also remove a memory copy in the case where out is None):

        # view encoded data as 1D array
        enc = _ndarray_from_buffer(buf, self.encode_dtype)

        # apply decoding
        dec = enc.astype(self.decode_dtype)

        # handle output
        out = _buffer_copy(dec, out)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I understand, are you suggesting just cutting the "setup decoded output" section basically? I think I copied it from the Delta filter. Though you are probably right that it is overkill.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is now cleaned up.

def __repr__(self):
r = '%s(encode_dtype=%s' % (type(self).__name__, self.encode_dtype)
if self.decode_dtype != self.encode_dtype:
r += ', decode_dtype=%s' % self.decode_dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to show both encode_dtype and decode_dtype in repr even if they are identical, as both are mandatory for this filter.

@alimanfoo alimanfoo modified the milestone: v2.2 Dec 15, 2016
@jakirkham
Copy link
Member Author

Sorry for the long turnaround. Have been out for a bit.

Thanks for the feedback. Have implemented most of it verbatim. Please let me know if there is anything else.

@alimanfoo
Copy link
Member

Thanks, all looks great to me. OK to merge?

@alimanfoo
Copy link
Member

One last thing before I forget, could you add the new astype method into the API docs for the Array class, under docs/api/core.rst. Also could you add the new AsType codec class into the API docs under docs/api/codecs.rst.

@jakirkham
Copy link
Member Author

Good catch. Have added them. Also tried generating the documentation locally to make sure it shows up ok.

Side note: I don't know if this fits with your style of documentation, but one can generate these auto command lines automatically with a few lines of code.

@jakirkham
Copy link
Member Author

Thanks for the good review.

Nothing more needed on my end.

At some point we should revisit casting, but am happy enough with this change as is. Shouldn't be hard to add support for it later.

@alimanfoo
Copy link
Member

alimanfoo commented Jan 4, 2017 via email

@alimanfoo alimanfoo merged commit eef8940 into zarr-developers:master Jan 4, 2017
@alimanfoo
Copy link
Member

Thanks a lot for seeing this one through, it's a nice feature. Just to let you know, when I come to do other work for the 2.2 release I will refactor the AsType codec class upstream to the numcodecs package along with all other codec classes.

@jakirkham
Copy link
Member Author

Thanks for all the tips along the way.

Sure make sense. Let me know if you need any help refactoring it.

@jakirkham jakirkham deleted the astype_filter branch January 4, 2017 15:27
@alimanfoo alimanfoo mentioned this pull request Oct 24, 2017
4 tasks
@alimanfoo alimanfoo added enhancement New features or improvements release notes done Automatically applied to PRs which have release notes. labels Nov 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New features or improvements release notes done Automatically applied to PRs which have release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants