Skip to content

netCDF4-python writes string (unicode) attributes as 1-d arrays, not scalars #448

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
shoyer opened this issue Aug 17, 2015 · 3 comments
Open

Comments

@shoyer
Copy link
Contributor

shoyer commented Aug 17, 2015

This code writes a single string attribute to an HDF5 file using netCDF4:

# Python 3.4.3
In [1]: import netCDF4

In [3]: ds = netCDF4.Dataset('/Users/shoyer/Downloads/global-attr.nc', 'w')

In [4]: ds.units = 'days since 1900'

In [5]: ds.close()

In [7]: !h5dump /Users/shoyer/Downloads/global-attr.nc
HDF5 "/Users/shoyer/Downloads/global-attr.nc" {
GROUP "/" {
   ATTRIBUTE "units" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      DATA {
      (0): "days since 1900"
      }
   }
}
}

Here's code do to the same thing with h5py:

In [8]: import h5py

In [9]: f = h5py.File('/Users/shoyer/Downloads/global-attr-h5py.nc')

In [10]: f.attrs['units'] = 'days since 1900'

In [11]: f.close()

In [12]: !h5dump /Users/shoyer/Downloads/global-attr-h5py.nc
HDF5 "/Users/shoyer/Downloads/global-attr-h5py.nc" {
GROUP "/" {
   ATTRIBUTE "units" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "days since 1900"
      }
   }
}
}

As you can see from the results of h5dump, netCDF4-python is writing the attribute as a "simple dataspace" which corresponds to a multi-dimensional array of 1-element:
https://www.hdfgroup.org/HDF5/doc/UG/UG_frame12Dataspaces.html

In fact, this is exactly what you get if you view the file created with netCDF4-python using h5py (to netCDF4-python and ncdump, they appear identical):

In [13]: f = h5py.File('/Users/shoyer/Downloads/global-attr.nc')

In [14]: f.attrs['units']
Out[14]: array([b'days since 1900'], dtype=object)

I believe netCDF4-python should be writing the attribute as a scalar, similarly to want it does if you write bytes (or a string on Python 2):

# python 2.7
In [11]: ds = netCDF4.Dataset('/Users/shoyer/Downloads/global-attr-py27.nc', 'w')

In [12]: ds.bytes_str = 'days since 1900'

In [13]: ds.unicode_str = u'days since 1900'

In [14]: ds.close()

In [15]: !h5dump /Users/shoyer/Downloads/global-attr-py27.nc
HDF5 "/Users/shoyer/Downloads/global-attr-py27.nc" {
GROUP "/" {
   ATTRIBUTE "bytes_str" {
      DATATYPE  H5T_STRING {
         STRSIZE 15;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "days since 1900"
      }
   }
   ATTRIBUTE "unicode_str" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      DATA {
      (0): "days since 1900"
      }
   }
}
}

Given that netCDF4-python is simply using the netCDF-C library's nc_put_att_string function, this may very well be a bug upstream in the netCDF-C library.

@jswhit
Copy link
Collaborator

jswhit commented Aug 17, 2015

Seems like when nc_put_att_text is used, the result is stored as a scalar in the hdf5 file. If nc_put_att_string is used (when the string is unicode) a simple dataspace is created. Here's the relevant code snippet in _netCDF4.pyx:

    if value_arr.dtype.char == 'U' and not is_netcdf3:
        # a unicode string, use put_att_string (if NETCDF4 file).
        ierr = nc_put_att_string(grp._grpid, varid, attname, 1, &datstring)
    else:
        ierr = nc_put_att_text(grp._grpid, varid, attname, lenarr, datstring)

I think you are right that this is due to how nc_put_att_string is implemented in the C library. It seems to be designed to write arrays of variable length strings.

@shoyer
Copy link
Contributor Author

shoyer commented Aug 19, 2015

Should I open a bug report for the C library, then?

@jswhit
Copy link
Collaborator

jswhit commented Aug 20, 2015

Sure, wouldn't hurt. At the very least maybe we will found out why they chose to do it that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants