-
-
Notifications
You must be signed in to change notification settings - Fork 328
Incorrect default fill value causes byte arrays to become numeric when write_empty_chunks=False
#965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
To clarify: the default should be "None", but behind the scenes, there should be a different default value for each dtype? Is there a complete mapping of dtype->default fill value for all dtypes supported by zarr-python? |
Maybe this comment ( #966 (comment) ) answers some questions? As to what this should be, there is a case for making it |
A similar underlying issue also manifests in an even more surprising way for object arrays. I say similar because I suspect that it's not exactly the same issue, but rather something closely related: my guess is that the truthiness of an object is being used to determine whether a chunk is empty or not, and for object arrays that may give in incorrect results since it results in a loss of information on what type of object is present. I don't think @jni's suggestion of maintaining a mapping of dtype->default fill values is feasible for arbitrary object arrays, so in that case Zarr may need to bite the bullet and always write out "empty" chunks. That's just a guess though, and maybe it is in fact the same issue described here. I'm happy to open a separate issue if others think it's worthwhile.
Output on zarr 2.10.3:
Output on zarr 2.11.0:
|
Ok, so we could have a dict, if the dtype is in the dict, use it, if not, |
I could see this working if we widen the type of the |
With @benjeffery's #966 closed, is anyone working on one of the suggestions above? |
Not me I'm afraid... 😞 |
@joshmoore I'm not sure how much fire this is causing. If reverting the default value change from #853 is a quick fix, perhaps we should go with that until this issue can be resolved "the right way". What do you think? |
Ah, interesting. If we don't hear from any objections or alternative proposals, I'd be happy to push that out quickly. |
I'm happy with a temporary reversion of the default. Making |
Is it only with If so, maybe there is a middle path where we do a partial reversion. IOW set This way we can keep the performance benefits for those interested in them without causing issues for types that are poorly handled currently. We could always improve this over time by moving the remaining types to Thoughts? 🙂 |
I think that's effectively @jni's solution of using a dict of known conversions, right? That sounded like a good solution, but in lieu of having that implemented soon I figured reverting the change would be a near-0 energy barrier way to avoid breaking workflows until someone got around to implementing the proper solution. Just my opinion though, I'm not a stakeholder on this project 🙂 |
I agree with this. That's not to say that it can't be made the default in the future, but it probably needs time to bed in. (I've also encountered problems with this change which mean I need to pin dependencies to avoid Zarr 2.11.0. See zarr-developers/zarr-specs#194) |
Fair enough if someone would like to send a PR to change to the default value, happy to review |
@jni @joshmoore @jakirkham I created a patch in #1001. |
Minimal example:
The value of
a[0]
is actually0
, when it should beb''
.Found by one of our users at tskit-dev/tsinfer#628 this bug was introduced in the latest release (
v2.11.0
) in this commit: f461eb7 when the default forwrite_empty_chunks
was changed toFalse
. The defaultfill_value
for arrays created viazarr.creation.create
is0
, so when an empty, unwritten chunk is re-created the previous valueb''
becomes0
. I assume thisfill_value
should beNone
.The text was updated successfully, but these errors were encountered: