-
-
Notifications
You must be signed in to change notification settings - Fork 330
Fix fill_value serialization issues #2802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
dcherian
merged 18 commits into
zarr-developers:main
from
moradology:fix/v2meta_fill_serialization
Apr 4, 2025
Merged
Changes from 9 commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
fd43cbf
Fix fill_value serialization of NaN
moradology 6301b15
Round trip serialization for array metadata v2/v3
moradology 260cfbc
Merge branch 'main' into fix/v2meta_fill_serialization
dcherian 19d61c9
Unify metadata v2 fill value parsing
moradology e3c7659
Merge branch 'main' into fix/v2meta_fill_serialization
moradology 29faec3
Test structured fill_value parsing
moradology 54920df
Merge branch 'main' into fix/v2meta_fill_serialization
moradology 921d2fa
Merge branch 'main' into fix/v2meta_fill_serialization
moradology a59f9ac
Merge branch 'main' into fix/v2meta_fill_serialization
moradology 042d815
Remove redundancies, fix integral handling
moradology 1388a3b
Reorganize structured fill parsing
moradology 66aa4d5
Merge branch 'main' into fix/v2meta_fill_serialization
moradology c46b27c
Bump up hypothesis deadline
moradology 11e2520
Remove hypothesis deadline
moradology 45606ec
Merge branch 'main' into fix/v2meta_fill_serialization
moradology 097faa4
Merge branch 'main' into fix/v2meta_fill_serialization
moradology 5c6166f
Update tests/test_v2.py
moradology 3651f4b
Merge branch 'main' into fix/v2meta_fill_serialization
dcherian File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Fix `fill_value` serialization for `NaN` in `ArrayV2Metadata` and add property-based testing of round-trip serialization |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,24 +2,25 @@ | |
|
||
import base64 | ||
import warnings | ||
from collections.abc import Iterable | ||
from collections.abc import Iterable, Mapping, Sequence | ||
from enum import Enum | ||
from functools import cached_property | ||
from typing import TYPE_CHECKING, TypedDict, cast | ||
from typing import TYPE_CHECKING, Any, TypedDict, cast | ||
|
||
import numcodecs.abc | ||
|
||
from zarr.abc.metadata import Metadata | ||
|
||
if TYPE_CHECKING: | ||
from typing import Any, Literal, Self | ||
from typing import Literal, Self | ||
|
||
import numpy.typing as npt | ||
|
||
from zarr.core.buffer import Buffer, BufferPrototype | ||
from zarr.core.common import ChunkCoords | ||
|
||
import json | ||
import numbers | ||
from dataclasses import dataclass, field, fields, replace | ||
|
||
import numcodecs | ||
|
@@ -108,6 +109,29 @@ | |
return None | ||
|
||
def to_buffer_dict(self, prototype: BufferPrototype) -> dict[str, Buffer]: | ||
def _serialize_fill_value(fv: Any) -> JSON: | ||
moradology marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if self.fill_value is None: | ||
pass | ||
elif self.dtype.kind in "SV": | ||
# There's a relationship between self.dtype and self.fill_value | ||
# that mypy isn't aware of. The fact that we have S or V dtype here | ||
# means we should have a bytes-type fill_value. | ||
fv = base64.standard_b64encode(cast(bytes, self.fill_value)).decode("ascii") | ||
moradology marked this conversation as resolved.
Show resolved
Hide resolved
|
||
elif isinstance(fv, np.datetime64): | ||
if np.isnat(fv): | ||
fv = "NaT" | ||
moradology marked this conversation as resolved.
Show resolved
Hide resolved
|
||
else: | ||
fv = np.datetime_as_string(fv) | ||
elif isinstance(fv, numbers.Real): | ||
float_fv = float(fv) | ||
if np.isnan(float_fv): | ||
fv = "NaN" | ||
elif np.isinf(float_fv): | ||
fv = "Infinity" if float_fv > 0 else "-Infinity" | ||
elif isinstance(fv, numbers.Complex): | ||
fv = [_serialize_fill_value(fv.real), _serialize_fill_value(fv.imag)] | ||
return cast(JSON, fv) | ||
|
||
def _json_convert( | ||
o: Any, | ||
) -> Any: | ||
|
@@ -146,41 +170,39 @@ | |
raise TypeError | ||
|
||
zarray_dict = self.to_dict() | ||
zarray_dict["fill_value"] = _serialize_fill_value(zarray_dict["fill_value"]) | ||
zattrs_dict = zarray_dict.pop("attributes", {}) | ||
json_indent = config.get("json_indent") | ||
return { | ||
ZARRAY_JSON: prototype.buffer.from_bytes( | ||
json.dumps(zarray_dict, default=_json_convert, indent=json_indent).encode() | ||
json.dumps( | ||
zarray_dict, default=_json_convert, indent=json_indent, allow_nan=False | ||
).encode() | ||
), | ||
ZATTRS_JSON: prototype.buffer.from_bytes( | ||
json.dumps(zattrs_dict, indent=json_indent).encode() | ||
json.dumps(zattrs_dict, indent=json_indent, allow_nan=False).encode() | ||
), | ||
} | ||
|
||
@classmethod | ||
def from_dict(cls, data: dict[str, Any]) -> ArrayV2Metadata: | ||
# make a copy to protect the original from modification | ||
def from_dict(cls, data: dict[str, JSON]) -> ArrayV2Metadata: | ||
# Make a copy to protect the original from modification. | ||
_data = data.copy() | ||
# check that the zarr_format attribute is correct | ||
# Check that the zarr_format attribute is correct. | ||
_ = parse_zarr_format(_data.pop("zarr_format")) | ||
dtype = parse_dtype(_data["dtype"]) | ||
|
||
if dtype.kind in "SV": | ||
fill_value_encoded = _data.get("fill_value") | ||
if fill_value_encoded is not None: | ||
fill_value = base64.standard_b64decode(fill_value_encoded) | ||
_data["fill_value"] = fill_value | ||
|
||
# zarr v2 allowed arbitrary keys here. | ||
# We don't want the ArrayV2Metadata constructor to fail just because someone put an | ||
# extra key in the metadata. | ||
# zarr v2 allowed arbitrary keys in the metadata. | ||
# Filter the keys to only those expected by the constructor. | ||
expected = {x.name for x in fields(cls)} | ||
# https://github.com/zarr-developers/zarr-python/issues/2269 | ||
# handle the renames | ||
expected |= {"dtype", "chunks"} | ||
|
||
# check if `filters` is an empty sequence; if so use None instead and raise a warning | ||
if _data["filters"] is not None and len(_data["filters"]) == 0: | ||
filters = _data.get("filters") | ||
if ( | ||
isinstance(filters, Sequence) | ||
and not isinstance(filters, (str, bytes)) | ||
and len(filters) == 0 | ||
): | ||
msg = ( | ||
"Found an empty list of filters in the array metadata document. " | ||
"This is contrary to the Zarr V2 specification, and will cause an error in the future. " | ||
|
@@ -191,18 +213,11 @@ | |
|
||
_data = {k: v for k, v in _data.items() if k in expected} | ||
|
||
return cls(**_data) | ||
return cls(**cast(Mapping[str, Any], _data)) | ||
moradology marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
def to_dict(self) -> dict[str, JSON]: | ||
zarray_dict = super().to_dict() | ||
|
||
if self.dtype.kind in "SV" and self.fill_value is not None: | ||
# There's a relationship between self.dtype and self.fill_value | ||
# that mypy isn't aware of. The fact that we have S or V dtype here | ||
# means we should have a bytes-type fill_value. | ||
fill_value = base64.standard_b64encode(cast(bytes, self.fill_value)).decode("ascii") | ||
zarray_dict["fill_value"] = fill_value | ||
|
||
_ = zarray_dict.pop("dtype") | ||
dtype_json: JSON | ||
# In the case of zarr v2, the simplest i.e., '|VXX' dtype is represented as a string | ||
|
@@ -300,7 +315,27 @@ | |
return data | ||
|
||
|
||
def parse_fill_value(fill_value: object, dtype: np.dtype[Any]) -> Any: | ||
def parse_structured_fill_value(fill_value: Any, dtype: np.dtype[Any]) -> Any: | ||
"""Handle structured dtype/fill value pairs""" | ||
try: | ||
if isinstance(fill_value, list): | ||
fill_value = tuple(fill_value) | ||
moradology marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if isinstance(fill_value, tuple): | ||
fill_value = np.array([fill_value], dtype=dtype)[0] | ||
elif isinstance(fill_value, bytes): | ||
fill_value = np.frombuffer(fill_value, dtype=dtype)[0] | ||
elif isinstance(fill_value, str): | ||
decoded = base64.standard_b64decode(fill_value) | ||
fill_value = np.frombuffer(decoded, dtype=dtype)[0] | ||
else: | ||
fill_value = np.array(fill_value, dtype=dtype)[()] | ||
except Exception as e: | ||
msg = f"Fill_value {fill_value} is not valid for dtype {dtype}." | ||
raise ValueError(msg) from e | ||
return fill_value | ||
|
||
|
||
def parse_fill_value(fill_value: Any, dtype: np.dtype[Any]) -> Any: | ||
""" | ||
Parse a potential fill value into a value that is compatible with the provided dtype. | ||
|
||
|
@@ -317,13 +352,13 @@ | |
""" | ||
|
||
if fill_value is None or dtype.hasobject: | ||
# no fill value | ||
pass | ||
elif dtype.fields is not None: | ||
moradology marked this conversation as resolved.
Show resolved
Hide resolved
|
||
fill_value = parse_structured_fill_value(fill_value, dtype) | ||
elif not isinstance(fill_value, np.void) and fill_value == 0: | ||
# this should be compatible across numpy versions for any array type, including | ||
# structured arrays | ||
fill_value = np.zeros((), dtype=dtype)[()] | ||
|
||
elif dtype.kind == "U": | ||
# special case unicode because of encoding issues on Windows if passed through numpy | ||
# https://github.com/alimanfoo/zarr/pull/172#issuecomment-343782713 | ||
|
@@ -332,6 +367,18 @@ | |
raise ValueError( | ||
f"fill_value {fill_value!r} is not valid for dtype {dtype}; must be a unicode string" | ||
) | ||
elif dtype.kind in "SV" and isinstance(fill_value, str): | ||
moradology marked this conversation as resolved.
Show resolved
Hide resolved
|
||
fill_value = base64.standard_b64decode(fill_value) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. out of scope but does anyone know why we base64 encode scalars with dtype |
||
elif np.issubdtype(dtype, np.datetime64): | ||
if fill_value == "NaT": | ||
fill_value = np.array("NaT", dtype=dtype)[()] | ||
else: | ||
fill_value = np.array(fill_value, dtype=dtype)[()] | ||
elif dtype.kind == "c" and isinstance(fill_value, list) and len(fill_value) == 2: | ||
complex_val = complex(float(fill_value[0]), float(fill_value[1])) | ||
fill_value = np.array(complex_val, dtype=dtype)[()] | ||
elif dtype.kind in "f" and fill_value in {"NaN", "Infinity", "-Infinity"}: | ||
fill_value = np.array(fill_value, dtype=dtype)[()] | ||
else: | ||
try: | ||
if isinstance(fill_value, bytes) and dtype.kind == "V": | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.