-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add support for NETCDF4_CLASSIC to h5netcdf engine #10686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…th h5netcdf engine
…ys with NETCDF4_CLASSIC
…cdf4 and h5netcdf
Xarray currently doesn't have any logic to build these metadata attributes. Currently this is all handled in h5netcdf. We should also make sure that trying to use NetCDF4-only features (e.g., groups) results in an error. |
The last commit uses h5dump to display differences between the expected and actual content of the HDF5 file. I was also able to add a I can try to raise an error if groups are used. The remaining differences are related to the SUPERBLOCK version , the
|
If you really want to get metadata attributes and precise HDF5 types right, that should all be handled in h5netcdf. I think that's also the right place for h5dump tests. In Xarray, all we should be doing for NETCDF4_CLASSIC is coercing some dtypes (using Xarray's encoders) to NetCDF3 compatible types. |
@huard Thanks for pushing this! For the superblock issue, please add kwarg For the |
@kmuehlbauer Thanks for the references, this is really helpful ! I'll remove the low-level stuff from this branch ( |
…o `netCDF4_.get_datatype` skips required conversions. Remove global attribute from create_test_data because it impacts other tests in other files.
This is ready for review. While I added a test doing a roundtrip between netCDF4 and h5netcdf CLASSIC format, it does check how files are actually written inside the HDF5 file, just that they can be written and read consistently by xarray. Non-standard reading rules can hide non-standard writing rules. I'm planning to add "binary compatibility" tests within |
if isinstance(value, bytes): | ||
value = np.bytes_(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this special logic only for converting bytes? This seems unrelated to what we need for NETCDF4_CLASSIC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure strings are written as NC_CHAR
, and not NC_STRING
. See https://engee.com/helpcenter/stable/en/julia/NetCDF/strings.html
This is in fact the detail that our third party software in C++ choked on. The netCDF C library has both nc_get_att_text
and nc_get_att_string
functions. Calling nc_get_att_text
on an NC_STRING
raises an error.
def test_string_attributes_stored_as_char(self, tmp_path): | ||
import h5netcdf | ||
|
||
original = Dataset(attrs={"foo": "bar"}) | ||
store_path = tmp_path / "tmp.nc" | ||
original.to_netcdf(store_path, engine=self.engine, format=self.file_format) | ||
with h5netcdf.File(store_path, "r") as ds: | ||
# Check that the attribute is stored as a char array | ||
assert ds._h5file.attrs["foo"].dtype == np.dtype("S3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NumPy's S
dtype actually corresponds to bytes
, not str
. I don't think we want to use it for storing attributes in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using fixed width chars replicates the behavior of the netCDF4 backend for the CLASSIC format. Again, this has to do with the NC_CHAR
vs NC_STRING
formats.
Sticking as close as possible to netCDF4 output increases my confidence that the h5netcdf outputs will be compatible with 3rd party software expecting the CLASSIC format.
xarray/tests/test_backends.py
Outdated
def test_group_fails(self): | ||
# Check writing group data fails with CLASSIC format | ||
original = create_test_data() | ||
with pytest.raises(ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please test the full error message (or a significant fraction) using the match
argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
xarray/tests/test_backends.py
Outdated
@@ -4759,7 +4806,7 @@ def test_encoding_unlimited_dims(self) -> None: | |||
@requires_scipy | |||
def test_roundtrip_via_bytes(self) -> None: | |||
original = create_test_data() | |||
netcdf_bytes = original.to_netcdf() | |||
netcdf_bytes = original.to_netcdf(engine="scipy") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is indeed mistakenly using h5netcdf, not scipy (which will be fixed by #10624), but the fact that it appears you needed to change it to get tests passing is concerning here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed it.
The error was due to an earlier version of the PR badly managing the logic for the format=None
case. I fixed it but kept the bug fix in not knowing there was an upcoming fix elsewhere.
if format is not None and Version(h5netcdf.__version__) > Version("1.6.4"): | ||
kwargs["format"] = format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been in fact added to h5netcdf yet?
We should figure out the API h5netcdf will accept first before using it in xarray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No change has been done in h5netcdf
.
This PR assumes that h5netcdf
knows nothing of the format argument for now, but that the next version will.
The plan is then to go inside h5netcdf and make a PR to support the CLASSIC format without breaking xarray tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But indeed, I'm assuming h5netcdf
will accept a PR adding the format
argument to their API, and that it will be merged before the next release. This might be over-optimistic and I'm happy to follow suggestions here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think we should finish the API on the h5netcdf side first. That eliminates the non-zero risk that h5netcdf releases a new version before your PR to h5netcdf lands, or that h5netcdf settles on a different API for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huard I agree with @shoyer and I'm gladly supporting a PR over at h5netcdf.
We should move fast, as a major version change is lurking around the corner. The integration of pyfive is almost ready. We could add in the NETCDF4_CLASSIC changes before or on top of the pyfive changes and release this together as h5netcdf v2.0.0. That way we could also prevent possible deprecation cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, see h5netcdf/h5netcdf#283
if format == "NETCDF4_CLASSIC" and group is not None: | ||
raise ValueError("Cannot create sub-groups in `NETCDF4_CLASSIC` format.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does h5netcdf give a suitable error message here already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
h5netcdf.File
does not even have a format
argument, so no.
@@ -392,7 +421,7 @@ def prepare_variable( | |||
nc4_var = self.ds[name] | |||
|
|||
for k, v in attrs.items(): | |||
nc4_var.attrs[k] = v | |||
nc4_var.attrs[k] = self.convert_string(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is doing the variable attribute conversion twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This here is for variable attributes only. The set_attribute
above only operates on global attributes.
xarray/backends/h5netcdf_.py
Outdated
|
||
return _encode_nc4_variable(variable, name=name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put this inside an else
clause so the logic is clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, some linters complain about the extra else, but agree this is clearer.
def set_attribute(self, key, value): | ||
value = self.convert_string(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure you need a helper function here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convert_string
function is called from two different places (variable attributes and global attributes), hence the helper function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. Made the suggested changes, but I'm afraid the string attributes need to be saved as fixed width char arrays to be compliant with the CLASSIC file format.
if format == "NETCDF4_CLASSIC" and group is not None: | ||
raise ValueError("Cannot create sub-groups in `NETCDF4_CLASSIC` format.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
h5netcdf.File
does not even have a format
argument, so no.
if format is not None and Version(h5netcdf.__version__) > Version("1.6.4"): | ||
kwargs["format"] = format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No change has been done in h5netcdf
.
This PR assumes that h5netcdf
knows nothing of the format argument for now, but that the next version will.
The plan is then to go inside h5netcdf and make a PR to support the CLASSIC format without breaking xarray tests.
if isinstance(value, bytes): | ||
value = np.bytes_(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure strings are written as NC_CHAR
, and not NC_STRING
. See https://engee.com/helpcenter/stable/en/julia/NetCDF/strings.html
This is in fact the detail that our third party software in C++ choked on. The netCDF C library has both nc_get_att_text
and nc_get_att_string
functions. Calling nc_get_att_text
on an NC_STRING
raises an error.
def set_attribute(self, key, value): | ||
value = self.convert_string(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convert_string
function is called from two different places (variable attributes and global attributes), hence the helper function.
xarray/backends/h5netcdf_.py
Outdated
|
||
return _encode_nc4_variable(variable, name=name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, some linters complain about the extra else, but agree this is clearer.
@@ -392,7 +421,7 @@ def prepare_variable( | |||
nc4_var = self.ds[name] | |||
|
|||
for k, v in attrs.items(): | |||
nc4_var.attrs[k] = v | |||
nc4_var.attrs[k] = self.convert_string(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This here is for variable attributes only. The set_attribute
above only operates on global attributes.
def test_string_attributes_stored_as_char(self, tmp_path): | ||
import h5netcdf | ||
|
||
original = Dataset(attrs={"foo": "bar"}) | ||
store_path = tmp_path / "tmp.nc" | ||
original.to_netcdf(store_path, engine=self.engine, format=self.file_format) | ||
with h5netcdf.File(store_path, "r") as ds: | ||
# Check that the attribute is stored as a char array | ||
assert ds._h5file.attrs["foo"].dtype == np.dtype("S3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using fixed width chars replicates the behavior of the netCDF4 backend for the CLASSIC format. Again, this has to do with the NC_CHAR
vs NC_STRING
formats.
Sticking as close as possible to netCDF4 output increases my confidence that the h5netcdf outputs will be compatible with 3rd party software expecting the CLASSIC format.
xarray/tests/test_backends.py
Outdated
def test_group_fails(self): | ||
# Check writing group data fails with CLASSIC format | ||
original = create_test_data() | ||
with pytest.raises(ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
xarray/tests/test_backends.py
Outdated
@@ -4759,7 +4806,7 @@ def test_encoding_unlimited_dims(self) -> None: | |||
@requires_scipy | |||
def test_roundtrip_via_bytes(self) -> None: | |||
original = create_test_data() | |||
netcdf_bytes = original.to_netcdf() | |||
netcdf_bytes = original.to_netcdf(engine="scipy") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed it.
The error was due to an earlier version of the PR badly managing the logic for the format=None
case. I fixed it but kept the bug fix in not knowing there was an upcoming fix elsewhere.
Added logic in the
h5netcdf
engine to write pseudo NETCDF4_CLASSIC files, reusing encoding logic used by the netcdf4` engine.The files generated with the PR using the latest
h5netcdf
release (1.6.4) won't be recognized by third party software as genuine NETCDF4_CLASSIC files, in part because they have no_nc3_strict
hidden global attribute. There are other differences with netCDF4 generated files, including string attributes padding, how_FillValue
is stored, etc. Changes toh5netcdf
will be necessary to make netCDF files fully compliant with the CLASSIC format.whats-new.rst