Skip to content

Make create_array signatures consistent #2819

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Feb 12, 2025

This PR ensures that the various invocations of create_array are consistent. For reference, we have 4 ways to call create_array:

  • zarr.core.array.create_array (the actual function that does stuff)
  • zarr.api.synchronous.create_array (synchronous wrapper around the async function)
  • zarr.core.group.AsyncGroup.create_array (method on AsyncGroup class that invokes create_array)
  • zarr.core.group.Group.create_array (synchronous wrapper around the AsyncGroup method)

All of these functions should have consistent signatures, but in main they don't, and a big part of this is missing tests. So this PR adds some tests that check that certain pairs of functions have identical parameters (we can't force the return types to match, because the async functions will return coroutines). To make the tests pass, this PR also ensures that all of the invocations of create_array have parameters that are consistent.

I say "consistent" because, when invoking Group.create_array, that method does not take a store argument, because we have one already from the group instance. Also, Group.create_array was recently given an extra keyword argument (compressor) that we need to deprecate. So the test that compares zarr.core.array.create_array with zarr.core.group.AsyncGroup.create_array only checks that all of create_array parameters are present in the group method.

closes #2810

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 12, 2025
@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 12, 2025

worth noting that this changes some defaults (we were using a default fill_value of None in some places, and 0 in other places, for example). When in doubt, I picked the default that seemed more defaulty.

@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Feb 12, 2025
@LDeakin
Copy link
Member

LDeakin commented Feb 13, 2025

worth noting that this changes some defaults (we were using a default fill_value of None in some places, and 0 in other places, for example). When in doubt, I picked the default that seemed more default.

fill_value = None seems a more sensible default to me, with some docs somewhere explaining how this maps to a default fill value for each data type. 0 needs conversion for any non-integer data type, and this is exactly what caused #2792 (comment) in zarr-python 2.

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 13, 2025

worth noting that this changes some defaults (we were using a default fill_value of None in some places, and 0 in other places, for example). When in doubt, I picked the default that seemed more default.

fill_value = None seems a more sensible default to me, with some docs somewhere explaining how this maps to a default fill value for each data type. 0 needs conversion for any non-integer data type, and this is exactly what caused #2792 (comment) in zarr-python 2.

That works for me. In main I think we have a mix of 0 and None floating around, I think I chose 0 for this PR because it seemed like the majority, but I'm happy to switch to None here

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 27, 2025

with 642272d the default fill value is None for the array creation funcs.

Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for this in general - I left one request for a more verbose changelog entry, and one question.

Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changelog update, looks great to me. I left one more suggestion - I did a bit of testing and this doesn't seem to change the fill value that's set on the arrays (which is good, but I think worth reassuring users)

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 28, 2025

Thanks for the changelog update, looks great to me. I left one more suggestion - I did a bit of testing and this doesn't seem to change the fill value that's set on the arrays (which is good, but I think worth reassuring users)

how did you check this? because for zarr v2 data, it should make a difference -- None is a valid v2 fill value, and it's not the same as 0

@dstansby
Copy link
Contributor

I didn't check for v2 🙈 . If it does make a difference, that should be made very clear in the changelog (and we should think about putting this PR in a non-bugfix release since it's a breaking change?)

@LDeakin
Copy link
Member

LDeakin commented Feb 28, 2025

how did you check this? because for zarr v2 data, it should make a difference -- None is a valid v2 fill value, and it's not the same as 0

I didn't check for v2 🙈 . If it does make a difference, that should be made very clear in the changelog (and we should think about putting this PR in a non-bugfix release since it's a breaking change?)

I'd hope that since the below PRs, the metadata of the Zarr V2 data with fill_values=None will have a null fill value but actual stored fill values will match the V3 default rather than being completely undefined.

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 28, 2025

@LDeakin that was my impression as well. I think for v2 data the contents of the array metadata will be sensitive to the None / 0 distinction, but at runtime the actual array data should be the same in zarr-python. but this might vary in other implementations.

@LDeakin
Copy link
Member

LDeakin commented Feb 28, 2025

I think for v2 data the contents of the array metadata will be sensitive to the None / 0 distinction, but at runtime the actual array data should be the same in zarr-python. but this might vary in other implementations

Indeed! To be honest, I think null fill values in Zarr V2 were a bit of a misstep anyway, mainly because undefined values in partially written chunks cannot be distinguished from real data. This could be a good opportunity for zarr-python to stop writing null fill values altogether in Zarr V2 metadata and just write the default fill value. What do you think?

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 28, 2025

The nullable fill value is also clunky because zarr v2 supports the python Object dtype, and None is a valid instance of the Object dtype, which makes it impossible to distinguish "no fill value" from "the fill value is the literal None" for arrays with that dtype (or any other dtype that admits None as a value).

We would have to get a measure of the impact on users, but if it doesn't inconvenience a lot of people then I would definitely support dropping the creation of arrays with a null fill value, at least for numeric types

@dstansby dstansby self-requested a review March 4, 2025 19:52
Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be a bit of a pain, but I'm going to put a request changes on this until we've resolved how we're going to bump the version number (and/or update our versioning policy) for this change. See #2819 (comment) for more context.

@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 4, 2025

that works for me! would you like to open an issue about the versioning policy, or should I?

@dstansby
Copy link
Contributor

dstansby commented Mar 4, 2025

I can 👍

@d-v-b d-v-b moved this from Done to In progress in zarr-python release 3.1.0 Apr 20, 2025
@d-v-b d-v-b added this to the 3.1.0 milestone May 9, 2025
@d-v-b
Copy link
Contributor Author

d-v-b commented May 22, 2025

it turns out that the default fill value of 0 is problematic for #2874, so I'd like to elevate the priority of this PR. Given our new versioning policy, do you think this is good to go after I sort out the merge conflicts @dstansby?

@dstansby
Copy link
Contributor

Is changing the default value of the fill_value argument a breaking change? Either way that should be made clearer in the changelog entry - and if it is breaking, we need a release plan for 3.1 before merging this, so we don't accidentally end up in a place where stuff targeted for 3.1 ends up blocking bug fix releases for a long time while all the work for 3.1 is finished.

@d-v-b
Copy link
Contributor Author

d-v-b commented May 22, 2025

ultimately I think whether these changes are breaking depends on whether you believe it's a bug that different definitions of the same function have different default values.

@dstansby
Copy link
Contributor

Does changing the values affect any behaviour, or any metadata that gets written to zarr arrays or groups? If not, then I'd say this isn't breaking. If it does, then I'd say it is breaking.

@d-v-b
Copy link
Contributor Author

d-v-b commented May 22, 2025

I think there are two questions:

  • Is the current situation (the same function with different default values) bugged? I believe the answer to this is "yes".
  • Is it possible to fix that bug without changing behavior of zarr-python? I believe the answer to this is "no".

Because this library produces data, we will inevitably have bugs that affect how data is produced. Therefore, fixing those bugs will change how data is produced. If the bug fix takes us from "writes invalid data" to "writes valid data", that's a clear win. But if the bug fix takes us from "writes valid data" to "writes slightly different valid data", then things are less clear cut. I think we are in this second scenario.

For simplicity lets say that these changes are breaking, and thus a good fit for a 3.1 release. it might be time to think about a 3.1 branch!

@dstansby
Copy link
Contributor

I think my confusion/questions are stemming from the short explanation of what's changed in the changelog entry. I think the default argument would benefit from being being added to it's own breaking or bugfix entry (whichever is more appropriate), and expanding to explain how this changes what metadata is written to files, and because it's a different fill value, how that changes the data values returned from emtpy portions of an array with fill_value=None vs. fill_value=0.

@d-v-b
Copy link
Contributor Author

d-v-b commented May 23, 2025

how that changes the data values returned from emtpy portions of an array with fill_value=None vs. fill_value=0.

there is no change here, zarr-python uses an array of zeros (or the zeros-equivalent, per data type) in either case.

@dstansby
Copy link
Contributor

I still need to wrap my head around the fill value changes a bit - in the meantime, perhaps adding the new arguments could be done in a separate PR, since that's definitely backwards compatible and probably something we could put in a 3.0.x release?

@d-v-b
Copy link
Contributor Author

d-v-b commented May 27, 2025

i would rather not split this PR into pieces.

@dstansby
Copy link
Contributor

Ah, because that would ruin the automatic testing you added - gotcha. Will try and give this a proper re-review soon.

Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 had another think about this, and I think it's good as long as the changelog is fixed so it's visible. I would like someone else to review this though, just to check the default changes to the fill value are okay.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since "chore type" changelong entries don't get rendered (e.g., see https://zarr.readthedocs.io/en/stable/release-notes.html#misc), I'd recommend splitting this up into a "feature" entry for the updated signatures, and a separate "feature" for the change in default fill_value arguments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

Group.create_array does not take a data keyword argument
3 participants