Skip to content

Incorrect default fill value causes byte arrays to become numeric when write_empty_chunks=False #965

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
benjeffery opened this issue Feb 15, 2022 · 16 comments · Fixed by #1001

Comments

@benjeffery
Copy link
Contributor

Minimal example:

import zarr
a = zarr.create((1,), dtype=bytes)
a[0] = b''
assert a[0] == b''

Traceback (most recent call last):
  File "example.py", line 5, in <module>
    assert a[0] == b''
AssertionError

The value of a[0] is actually 0, when it should be b''.

Found by one of our users at tskit-dev/tsinfer#628 this bug was introduced in the latest release (v2.11.0) in this commit: f461eb7 when the default for write_empty_chunks was changed to False. The default fill_value for arrays created via zarr.creation.create is 0, so when an empty, unwritten chunk is re-created the previous value b'' becomes 0. I assume this fill_value should be None.

@jakirkham
Copy link
Member

cc @d-v-b @jni (for vis)

@jni
Copy link
Contributor

jni commented Feb 15, 2022

To clarify: the default should be "None", but behind the scenes, there should be a different default value for each dtype? Is there a complete mapping of dtype->default fill value for all dtypes supported by zarr-python?

@jakirkham
Copy link
Member

Maybe this comment ( #966 (comment) ) answers some questions?

As to what this should be, there is a case for making it None here as it is elsewhere, but there are tests that check this is 0

@vyasr
Copy link
Contributor

vyasr commented Mar 3, 2022

A similar underlying issue also manifests in an even more surprising way for object arrays. I say similar because I suspect that it's not exactly the same issue, but rather something closely related: my guess is that the truthiness of an object is being used to determine whether a chunk is empty or not, and for object arrays that may give in incorrect results since it results in a loss of information on what type of object is present. I don't think @jni's suggestion of maintaining a mapping of dtype->default fill values is feasible for arbitrary object arrays, so in that case Zarr may need to bite the bullet and always write out "empty" chunks. That's just a guess though, and maybe it is in fact the same issue described here. I'm happy to open a separate issue if others think it's worthwhile.

import zarr
import numcodecs

dataset = zarr.create(
    shape=1, dtype="object", object_codec=numcodecs.JSON()
)
dataset[0] = {}
print(dataset[0])

dataset = zarr.create(
    shape=1, dtype="object", object_codec=numcodecs.JSON(), fill_value={}
)
dataset[0] = {}
print(dataset[0])

Output on zarr 2.10.3:

{}
{}

Output on zarr 2.11.0:

0
{}

@jni
Copy link
Contributor

jni commented Mar 4, 2022

I don't think @jni's suggestion of maintaining a mapping of dtype->default fill values is feasible for arbitrary object arrays

Ok, so we could have a dict, if the dtype is in the dict, use it, if not, write_empty_chunks goes back to True?

@d-v-b
Copy link
Contributor

d-v-b commented Mar 4, 2022

Ok, so we could have a dict, if the dtype is in the dict, use it, if not, write_empty_chunks goes back to True?

I could see this working if we widen the type of the write_empty_chunks to Union[bool, Literal['auto']], where auto does the behavior you describe

@joshmoore
Copy link
Member

With @benjeffery's #966 closed, is anyone working on one of the suggestions above?

@jni
Copy link
Contributor

jni commented Mar 10, 2022

Not me I'm afraid... 😞

@jni
Copy link
Contributor

jni commented Mar 20, 2022

@joshmoore I'm not sure how much fire this is causing. If reverting the default value change from #853 is a quick fix, perhaps we should go with that until this issue can be resolved "the right way". What do you think?

@joshmoore
Copy link
Member

Ah, interesting. If we don't hear from any objections or alternative proposals, I'd be happy to push that out quickly.

@vyasr
Copy link
Contributor

vyasr commented Mar 25, 2022

I'm happy with a temporary reversion of the default. Making write_empty_chunks=False makes sense to me, but not at the expense of the ability to faithfully represent non-numeric values. My 2c: until a suitable solution is found I think it's fine to require users who want the space savings to opt in by setting the flag.

@jakirkham
Copy link
Member

jakirkham commented Mar 25, 2022

Is it only with object or bytes?

If so, maybe there is a middle path where we do a partial reversion. IOW set write_empty_chunks=None if we are working with objects or bytes we replace None with False otherwise we set write_empty_chunks to True.

This way we can keep the performance benefits for those interested in them without causing issues for types that are poorly handled currently. We could always improve this over time by moving the remaining types to True once handled correctly.

Thoughts? 🙂

@vyasr
Copy link
Contributor

vyasr commented Mar 31, 2022

I think that's effectively @jni's solution of using a dict of known conversions, right? That sounded like a good solution, but in lieu of having that implemented soon I figured reverting the change would be a near-0 energy barrier way to avoid breaking workflows until someone got around to implementing the proper solution. Just my opinion though, I'm not a stakeholder on this project 🙂

@tomwhite
Copy link
Contributor

tomwhite commented Apr 1, 2022

My 2c: until a suitable solution is found I think it's fine to require users who want the space savings to opt in by setting the flag.

I agree with this. That's not to say that it can't be made the default in the future, but it probably needs time to bed in.

(I've also encountered problems with this change which mean I need to pin dependencies to avoid Zarr 2.11.0. See zarr-developers/zarr-specs#194)

@jakirkham
Copy link
Member

Fair enough if someone would like to send a PR to change to the default value, happy to review

@vyasr
Copy link
Contributor

vyasr commented Apr 1, 2022

@jni @joshmoore @jakirkham I created a patch in #1001.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants