Skip to content

Conversation

huard
Copy link
Contributor

@huard huard commented Sep 3, 2025

Added logic in the h5netcdf engine to write pseudo NETCDF4_CLASSIC files, reusing encoding logic used by the netcdf4` engine.

The files generated with the PR using the latest h5netcdf release (1.6.4) won't be recognized by third party software as genuine NETCDF4_CLASSIC files, in part because they have no _nc3_strict hidden global attribute. There are other differences with netCDF4 generated files, including string attributes padding, how _FillValue is stored, etc. Changes to h5netcdf will be necessary to make netCDF files fully compliant with the CLASSIC format.

@shoyer
Copy link
Member

shoyer commented Sep 3, 2025

Xarray currently doesn't have any logic to build these metadata attributes. Currently this is all handled in h5netcdf.

We should also make sure that trying to use NetCDF4-only features (e.g., groups) results in an error.

@huard
Copy link
Contributor Author

huard commented Sep 3, 2025

The last commit uses h5dump to display differences between the expected and actual content of the HDF5 file. I was also able to add a _nc3_strict global attribute.

I can try to raise an error if groups are used.

The remaining differences are related to the SUPERBLOCK version , the STRPAD character, and the _FillValue. Not sure I'll be able to resolve those.

    SUPER_BLOCK {
  -    SUPERBLOCK_VERSION 0
  ?                       ^
  +    SUPERBLOCK_VERSION 2
  ?                       ^
...
         ATTRIBUTE "foo" {
             DATATYPE  H5T_STRING {
                STRSIZE 8;
  -             STRPAD H5T_STR_NULLPAD;
  ?                                ^^^
  +             STRPAD H5T_STR_NULLTERM;
  ?                                ^^^^
                CSET H5T_CSET_ASCII;
                CTYPE H5T_C_S1;
             }
...
          ATTRIBUTE "_FillValue" {
             DATATYPE  H5T_IEEE_F64LE
  -          DATASPACE  SCALAR
  +          DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
             DATA {
             (0): nan
             }
          }

@shoyer
Copy link
Member

shoyer commented Sep 3, 2025

If you really want to get metadata attributes and precise HDF5 types right, that should all be handled in h5netcdf. I think that's also the right place for h5dump tests.

In Xarray, all we should be doing for NETCDF4_CLASSIC is coercing some dtypes (using Xarray's encoders) to NetCDF3 compatible types.

@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Sep 3, 2025

@huard Thanks for pushing this!

For the superblock issue, please add kwarg libver="earliest" when opening the file for writing. This will create the file with superblock version 0 for maximum backwards compatibility.

For the NULLPAD vs. NULLTERM there is some reading material here PyTables/PyTables#264 and here h5netcdf/h5netcdf#116. This one would need to be implemented in h5netcdf, if need be.

@huard
Copy link
Contributor Author

huard commented Sep 4, 2025

@kmuehlbauer Thanks for the references, this is really helpful !

I'll remove the low-level stuff from this branch (_nc3_strict) and bring it into h5netcdf.

@huard
Copy link
Contributor Author

huard commented Sep 9, 2025

This is ready for review.

While I added a test doing a roundtrip between netCDF4 and h5netcdf CLASSIC format, it does check how files are actually written inside the HDF5 file, just that they can be written and read consistently by xarray. Non-standard reading rules can hide non-standard writing rules. I'm planning to add "binary compatibility" tests within h5netcdf.

Comment on lines +339 to +340
if isinstance(value, bytes):
value = np.bytes_(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this special logic only for converting bytes? This seems unrelated to what we need for NETCDF4_CLASSIC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure strings are written as NC_CHAR, and not NC_STRING. See https://engee.com/helpcenter/stable/en/julia/NetCDF/strings.html

This is in fact the detail that our third party software in C++ choked on. The netCDF C library has both nc_get_att_text and nc_get_att_string functions. Calling nc_get_att_text on an NC_STRING raises an error.

Comment on lines +4679 to +4687
def test_string_attributes_stored_as_char(self, tmp_path):
import h5netcdf

original = Dataset(attrs={"foo": "bar"})
store_path = tmp_path / "tmp.nc"
original.to_netcdf(store_path, engine=self.engine, format=self.file_format)
with h5netcdf.File(store_path, "r") as ds:
# Check that the attribute is stored as a char array
assert ds._h5file.attrs["foo"].dtype == np.dtype("S3")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumPy's S dtype actually corresponds to bytes, not str. I don't think we want to use it for storing attributes in general.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using fixed width chars replicates the behavior of the netCDF4 backend for the CLASSIC format. Again, this has to do with the NC_CHAR vs NC_STRING formats.

Sticking as close as possible to netCDF4 output increases my confidence that the h5netcdf outputs will be compatible with 3rd party software expecting the CLASSIC format.

def test_group_fails(self):
# Check writing group data fails with CLASSIC format
original = create_test_data()
with pytest.raises(ValueError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please test the full error message (or a significant fraction) using the match argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -4759,7 +4806,7 @@ def test_encoding_unlimited_dims(self) -> None:
@requires_scipy
def test_roundtrip_via_bytes(self) -> None:
original = create_test_data()
netcdf_bytes = original.to_netcdf()
netcdf_bytes = original.to_netcdf(engine="scipy")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is indeed mistakenly using h5netcdf, not scipy (which will be fixed by #10624), but the fact that it appears you needed to change it to get tests passing is concerning here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed it.

The error was due to an earlier version of the PR badly managing the logic for the format=None case. I fixed it but kept the bug fix in not knowing there was an upcoming fix elsewhere.

Comment on lines +218 to +219
if format is not None and Version(h5netcdf.__version__) > Version("1.6.4"):
kwargs["format"] = format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been in fact added to h5netcdf yet?

We should figure out the API h5netcdf will accept first before using it in xarray.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change has been done in h5netcdf.

This PR assumes that h5netcdf knows nothing of the format argument for now, but that the next version will.

The plan is then to go inside h5netcdf and make a PR to support the CLASSIC format without breaking xarray tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But indeed, I'm assuming h5netcdf will accept a PR adding the format argument to their API, and that it will be merged before the next release. This might be over-optimistic and I'm happy to follow suggestions here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think we should finish the API on the h5netcdf side first. That eliminates the non-zero risk that h5netcdf releases a new version before your PR to h5netcdf lands, or that h5netcdf settles on a different API for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huard I agree with @shoyer and I'm gladly supporting a PR over at h5netcdf.

We should move fast, as a major version change is lurking around the corner. The integration of pyfive is almost ready. We could add in the NETCDF4_CLASSIC changes before or on top of the pyfive changes and release this together as h5netcdf v2.0.0. That way we could also prevent possible deprecation cycle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, see h5netcdf/h5netcdf#283

Comment on lines +145 to +146
if format == "NETCDF4_CLASSIC" and group is not None:
raise ValueError("Cannot create sub-groups in `NETCDF4_CLASSIC` format.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does h5netcdf give a suitable error message here already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h5netcdf.File does not even have a format argument, so no.

@@ -392,7 +421,7 @@ def prepare_variable(
nc4_var = self.ds[name]

for k, v in attrs.items():
nc4_var.attrs[k] = v
nc4_var.attrs[k] = self.convert_string(v)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is doing the variable attribute conversion twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This here is for variable attributes only. The set_attribute above only operates on global attributes.

Comment on lines 350 to 351

return _encode_nc4_variable(variable, name=name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put this inside an else clause so the logic is clearer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, some linters complain about the extra else, but agree this is clearer.

def set_attribute(self, key, value):
value = self.convert_string(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you need a helper function here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convert_string function is called from two different places (variable attributes and global attributes), hence the helper function.

Copy link
Contributor Author

@huard huard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. Made the suggested changes, but I'm afraid the string attributes need to be saved as fixed width char arrays to be compliant with the CLASSIC file format.

Comment on lines +145 to +146
if format == "NETCDF4_CLASSIC" and group is not None:
raise ValueError("Cannot create sub-groups in `NETCDF4_CLASSIC` format.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h5netcdf.File does not even have a format argument, so no.

Comment on lines +218 to +219
if format is not None and Version(h5netcdf.__version__) > Version("1.6.4"):
kwargs["format"] = format
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change has been done in h5netcdf.

This PR assumes that h5netcdf knows nothing of the format argument for now, but that the next version will.

The plan is then to go inside h5netcdf and make a PR to support the CLASSIC format without breaking xarray tests.

Comment on lines +339 to +340
if isinstance(value, bytes):
value = np.bytes_(value)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure strings are written as NC_CHAR, and not NC_STRING. See https://engee.com/helpcenter/stable/en/julia/NetCDF/strings.html

This is in fact the detail that our third party software in C++ choked on. The netCDF C library has both nc_get_att_text and nc_get_att_string functions. Calling nc_get_att_text on an NC_STRING raises an error.

def set_attribute(self, key, value):
value = self.convert_string(value)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convert_string function is called from two different places (variable attributes and global attributes), hence the helper function.

Comment on lines 350 to 351

return _encode_nc4_variable(variable, name=name)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, some linters complain about the extra else, but agree this is clearer.

@@ -392,7 +421,7 @@ def prepare_variable(
nc4_var = self.ds[name]

for k, v in attrs.items():
nc4_var.attrs[k] = v
nc4_var.attrs[k] = self.convert_string(v)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This here is for variable attributes only. The set_attribute above only operates on global attributes.

Comment on lines +4679 to +4687
def test_string_attributes_stored_as_char(self, tmp_path):
import h5netcdf

original = Dataset(attrs={"foo": "bar"})
store_path = tmp_path / "tmp.nc"
original.to_netcdf(store_path, engine=self.engine, format=self.file_format)
with h5netcdf.File(store_path, "r") as ds:
# Check that the attribute is stored as a char array
assert ds._h5file.attrs["foo"].dtype == np.dtype("S3")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using fixed width chars replicates the behavior of the netCDF4 backend for the CLASSIC format. Again, this has to do with the NC_CHAR vs NC_STRING formats.

Sticking as close as possible to netCDF4 output increases my confidence that the h5netcdf outputs will be compatible with 3rd party software expecting the CLASSIC format.

def test_group_fails(self):
# Check writing group data fails with CLASSIC format
original = create_test_data()
with pytest.raises(ValueError):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -4759,7 +4806,7 @@ def test_encoding_unlimited_dims(self) -> None:
@requires_scipy
def test_roundtrip_via_bytes(self) -> None:
original = create_test_data()
netcdf_bytes = original.to_netcdf()
netcdf_bytes = original.to_netcdf(engine="scipy")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed it.

The error was due to an earlier version of the PR badly managing the logic for the format=None case. I fixed it but kept the bug fix in not knowing there was an upcoming fix elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support "NETCDF4_CLASSIC" format with engine h5netcdf
3 participants