Skip to content

Conversation

@henriaidasso
Copy link
Collaborator

@henriaidasso henriaidasso commented Oct 8, 2025

Implementation for RDPS metadata integration. Following metadata/extensions have been added.

@@ -0,0 +1,326 @@
"""RPDS Extension Module."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this is so much more complicated than the cordex extension.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.

The replicates a lot of things that other helpers or the main populator handles, like extracting the THREDDS NCML details and defining the STAC core properties.

Unless RDPS/HRDPS have specific fields only applicable to them, there might not even be a need for a specific helper for them. Helpers are mostly for mapping STAC extensions to pydantic models. If all available helpers are sufficient, they should be used directly by the populator implementations rather than adding an intermediate helper.

@huard
Copy link
Collaborator

huard commented Oct 9, 2025

Could you provide a bit more context on the choices you made?

  • I'm curious to understand why you decided not to reuse the base classes in extensions/base.py that are meant to avoid a lot of the boilerplate.
  • The THREDDSCatalogDataModel can be subclassed to get a lot done for free. If something is missing, I would prefer adding it to the THREDDSCatalogDataModel once than copy-paste the same code for all extensions.
  • Also, in my mind, we create an extension when there is a controlled vocabulary to comply with. I don't think there is one for RDPS and HRDPS, so I see them as mere "implementations" relying on other extensions, such as cf, datacube and the new ones you're proposing.

Copy link
Contributor

@mishaschwartz mishaschwartz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @huard here. I'm not sure exactly how this best fits in with the rest of stac-populator and I think some additional comments/doc-strings/documentation would help explain your motivations.

Please see the comments below for specific questions as well.

pyproject.toml Outdated
"pyessv~=0.9",
"requests~=2.32",
"lxml~=6.0",
"dataclasses-json~=0.6.7"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the advantage to using dataclasses-json as opposed to using pydantic to define and serialize data classes like we're already doing elsewhere?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I don't see the advantage. Simply use pydantic with model_dump/model_dump_json as needed. We don't need yet another method to deal with the classes (pydantic/pystac approaches already needs a lot of workarounds).

return {
key: asset
for key, asset in self.item.get_assets().items()
if (service_type is ServiceType and service_type.value in asset.extra_fields)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this always be False since service_type is an enum value and ServiceType is the class?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Copied it from the cmip6.py extension module and didn't pay much attention. But yeah, isinstance() should be used here.

import pystac
from dataclasses_json import LetterCase, config, dataclass_json

SCHEMA_URI = "https://stac-extensions.github.io/contacts/v0.1.1/schema.json"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if we already have the json schema for this extension is there a reason why we're redefining all the various fields in the dataclasses as well?

I thought the dataclasses were meant to extend the json schemas, not to replicate the same information.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Didn't know that :) Makes sense.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema reference must be included in the Collection/Item stac_extensions, otherwise it won't be effective (nor validated during the POST API request).

Normally, the https://github.com/stac-utils/pystac/tree/main/pystac/extensions definition auto-apply this reference when .ext() is invoked on a given pystac.STACObject.

pystac's approach is to extend the object with additional properties and apply any such references within the extension (all self-contained). This avoids having (or forgetting) to manually add item["stac_extensions"].append(SCHEMA_URI). Also, it avoids importing these references for every possible schema to apply. The drawback is that their "extension" classes are not using latest python methods (dataclasses/pydantic models), hence why we have the "helpers" here to do the mapping between the 2 strategies.

As we can see in this PR, the classes below are explicitly imported individually in STACpopulator/populator_base.py and requires various updates to it. This does not extend well with future extensions. They should all be inserted via the "helper" mechanism, and only this "helper" should deal with the mapping of any relevant properties coming from the config, their location in the collection/items, and the insertion of the schema URI.

.gitignore Outdated
pyessv-archive/

# Temp files downloaded
temp_files/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a new location that we're storing temporary files on disk? Why not use the users temp director or their cache directory if you want these files to persist for longer?

self.href = href
self.download_dir = Path(download_dir)
self.download_dir.mkdir(exist_ok=True)
self.local_path = self.download_dir / Path(href).name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why downloading a file locally is useful here?

Is this for accessing netcdf files that are not being served by thredds?

If that's the case can we not just download the metadata in the header instead of the whole file since that file could be huge in some cases.

And is there any code that cleans up these files when done?

Copy link
Collaborator Author

@henriaidasso henriaidasso Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed information on the file (e.g., size in bytes) that are not available in the NCML metadata. I mentioned this during our last meeting.
It would be more efficient to have the info directly from the metadata instead. But maybe I am not looking in the right location?

And yes a cleanup is being implemented.

Copy link
Contributor

@mishaschwartz mishaschwartz Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok that makes sense.

Just a suggestion... you could read the metadata into memory and then count the rest of the bytes in the download stream, calculate the checksum, check the byte order, etc. without actually saving the file to disk.

That way you'd still get all the same file information and the metadata and you wouldn't have to actually write the file and keep it on disk.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, is there a suggestion to stream the entire content of the server to create the catalog? If yes, this is a big no, no for me. We have hundreds of terabytes of data. Streaming those when creating or updating the catalog will take forever and gobble up bandwidth. The file size is available from the HTML (https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html), so certainly there is an easier way to get it.
I don't know about the checksum. I found this https://docs.unidata.ucar.edu/netcdf-java/current/javadocAll/dap4/core/util/ChecksumMode.html but I don't really understand what it does exactly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the ChecksumMode allows you to get the checksum for individual variables inside the file. For example:
https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dap4/birdhouse/testdata/xclim/CRCM5/tasmax_bby_198406_se.nc.dmr.xml?dap4.checksum=true#dap4

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright thanks! By the way, how to get the byte order? It seems to be the endianness but I have that only for each variable and not for the whole netCDF file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea, sorry.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, using the metadata is much better. Downloads do not scale.
An HTTP HEAD request can be done to return them without the content.

❯ curl --head 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc'
HTTP/1.1 200
Server: nginx/1.23.4
Date: Fri, 10 Oct 2025 22:14:57 GMT
Content-Type: application/x-netcdf
Content-Length: 565270656
[...]

@huard
Nice find ;)

Using https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dap4/birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc.dmr.xml?dap4.checksum=true#dap4 (same file as above HEAD request), we can find _DAP4_Checksum_CRC32 entries. However, they are for each distinct variable, not the file as a whole.

Since NetCDF-4 relies on HDF-5, I wonder if its checksum could be retrieved from THREDDS or another way? If not, maybe in could be encoded in an attribute somewhere?

I also found this amongst the response:

<Attribute name="_DAP4_Little_Endian" type="UInt8">
  <Value value="1" />
</Attribute>

It aligns with https://docs.unidata.ucar.edu/netcdf-c/4.8.1/netcdf_8h.html#ab9a4b12a73eeb0ed8ee5eada6e606132 value.

While playing around in there, I also found this link (UDDC):
https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/uddc/birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc
There is notably creator_..., contributor_..., publisher_..., acknowledgment, license and keywords attributes that are available but unfilled. All of these are what the config YAML was adding manually. Is there any chance those could be set so the populator can extract them directly from the source?

"""Initialize the FileHelper object."""
self.href = href
self.download_dir = Path(download_dir)
self.download_dir.mkdir(exist_ok=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this to the download function since the existence of the parent directory is needed to download the file.

@property
def uid(self) -> str:
"""Return a unique ID for RDPS data item."""
item_url = self.attrs["access_urls"]["HTTPServer"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The httpserver link is not guaranteed to exist. Please use the same pattern as we're using elsewhere to check for alternate access URLs in the case where one doesn't exist

from STACpopulator.extensions.cordex6 import Cordex6DataModel, Cordex6DataModelNcML


def get_first_item_attrs(url):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove these functions for the cordex code? This seems unrelated to the other changes here

test.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is empty.

If this is supposed to contain tests, please put those tests in a separate file in the tests directory

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, some tests must be added. I am not sure what the result looks like currently with RDPS/HRDPS enabled.

Copy link
Collaborator

@fmigneault fmigneault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just need some metadata validation with Ouranos to make sure.

Comment on lines +24 to +25
assert "cf:parameter" in item["properties"]
assert set(["time", "latitude", "longitude"]).issubset(set(parameter_names))
Copy link
Collaborator

@fmigneault fmigneault Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huard
I would like to validate the source metadata for this case.

For example, using:
https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncml/birdhouse/testdata/HRDPS/HRDPS_sample/HRDPS_P_PR_SFC/2024010100.nc

We can see both

<dimension name="rlon" length="2540" />
<dimension name="rlat" length="1290" />
<dimension name="time" length="13" isUnlimited="true" />

and

<variable name="lon" shape="rlat rlon" type="float">...</variable>
<variable name="lat" shape="rlat rlon" type="float">...</variable>
<variable name="rlon" shape="rlon" type="float">...</variable>
<variable name="rlat" shape="rlat" type="float">...</variable>
<variable name="time" shape="time" type="int">...</variable>

Why are rlon/rlat/time both a variable and a dimension? Is that valid, and what does it imply in terms of the NetCDF? Accessing these variables would contain themselves?

Loading it with xarray, I can indeed do some questionable recursive data["rlon"]["rlon"]["rlon"] calls.

>>> data
<xarray.Dataset> Size: 197MB
Dimensions:         (time: 13, rlat: 1290, rlon: 2540)
Coordinates:
  * time            (time) datetime64[ns] 104B 2024-01-01 ... 2024-01-01T12:0...
  * rlat            (rlat) float32 5kB -12.3 -12.28 -12.26 ... 16.66 16.68 16.7
  * rlon            (rlon) float32 10kB -14.82 -14.8 -14.78 ... 42.28 42.31
    lon             (rlat, rlon) float32 13MB ...
    lat             (rlat, rlon) float32 13MB ...
Data variables:
    rotated_pole    float32 4B ...
    HRDPS_P_PR_SFC  (time, rlat, rlon) float32 170MB ...
Attributes:
    Remarks:                         Variable names are following the convent...
    License:                         These data are provided by the Canadian ...
    product:                         HRDPS
    Conventions:                     CF-1.6
    DODS_EXTRA.Unlimited_Dimension:  time

>>> data["rlon"]["rlon"]["rlon"]
<xarray.DataArray 'rlon' (rlon: 2540)> Size: 10kB
array([-14.821219, -14.798713, -14.776221, ...,  42.261284,  42.283775,
        42.306282], shape=(2540,), dtype=float32)
Coordinates:
  * rlon     (rlon) float32 10kB -14.82 -14.8 -14.78 ... 42.26 42.28 42.31
Attributes:
    long_name:             longitude in rotated pole grid
    units:                 degrees
    eccc_grid_definition:  grtyp: E, ig1: 1430, ig2: 500, ig3: 56000, ig4: 44000
    standard_name:         grid_longitude
    axis:                  X
    _ChunkSizes:           2540

>>> data["rlon"]["rlon"]["rlon"][0]
<xarray.DataArray 'rlon' ()> Size: 4B
array(-14.821219, dtype=float32)
Coordinates:
    rlon     float32 4B -14.82
Attributes:
    long_name:             longitude in rotated pole grid
    units:                 degrees
    eccc_grid_definition:  grtyp: E, ig1: 1430, ig2: 500, ig3: 56000, ig4: 44000
    standard_name:         grid_longitude
    axis:                  X
    _ChunkSizes:           2540

I would expect that cf:parameter only contains the actual cube:variables, and those 3 (rlon, rlat, time) are only in cube:dimensions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@henriaidasso
Can you post the JSON result of one STAC Item sample?
We would like to validate the results of the produced dimensions/variables.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you go

{
  "id": "birdhouse__testdata__HRDPS__RDPS_sample__2024010106_012.nc",
  "bbox": [0.000289999996311962, -7.74911022186279, 359.99951171875, 89.9657287597656],
  "type": "Feature",
  "links": [
    {
      "rel": "collection",
      "type": "application/json",
      "href": "http://localhost:8081/stac/collections/RDPS_CRIM"
    },
    {
      "rel": "parent",
      "type": "application/json",
      "href": "http://localhost:8081/stac/collections/RDPS_CRIM"
    },
    {
      "rel": "root",
      "type": "application/json",
      "href": "http://localhost:8081/stac/"
    },
    {
      "rel": "self",
      "type": "application/geo+json",
      "href": "http://localhost:8081/stac/collections/RDPS_CRIM/items/birdhouse__testdata__HRDPS__RDPS_sample__2024010106_012.nc"
    },
    {
      "rel": "source",
      "href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc",
      "type": "application/x-netcdf",
      "title": "birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc"
    }
  ],
  "assets": {
    "ISO": {
      "href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/iso/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc",
      "type": "",
      "roles": []
    },
    "WCS": {
      "href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/wcs/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc?request=GetCapabilities",
      "type": "application/xml",
      "roles": [
        "data"
      ]
    },
    "WMS": {
      "href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/wms/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc?request=GetCapabilities",
      "type": "application/xml",
      "roles": [
        "visual"
      ]
    },
    "NcML": {
      "href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncml/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc",
      "type": "",
      "roles": []
    },
    "UDDC": {
      "href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/uddc/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc",
      "type": "",
      "roles": []
    },
    "OpenDAP": {
      "href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc",
      "type": "text/html",
      "roles": [
        "data"
      ]
    },
    "HTTPServer": {
      "href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc",
      "type": "application/x-netcdf",
      "roles": [
        "data"
      ],
      "file:size": 137644165,
      "cf:parameter": [
        {
          "name": "time",
          "unit": "hours since 2024-01-01 18:00:00"
        },
        {
          "name": "air_pressure",
          "unit": "hPa"
        },
        {
          "name": "grid_latitude",
          "unit": "degrees"
        },
        {
          "name": "grid_longitude",
          "unit": "degrees"
        },
        {
          "name": "atmosphere_hybrid_sigma_ln_pressure_coordinate",
          "unit": ""
        },
        {
          "name": "forecast_period",
          "unit": "hours"
        },
        {
          "name": "forecast_reference_time",
          "unit": "hours since 2024-01-01 06:00:00"
        },
        {
          "name": "longitude",
          "unit": "degrees_east"
        },
        {
          "name": "latitude",
          "unit": "degrees_north"
        }
      ]
    },
    "NetcdfSubsetGrid": {
      "href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncss/grid/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc",
      "type": "application/x-netcdf",
      "roles": [
        "data"
      ]
    },
    "NetcdfSubsetPoint": {
      "href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncss/point/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc",
      "type": "application/x-netcdf",
      "roles": [
        "data"
      ]
    }
  },
  "geometry": {
    "type": "Polygon",
    "coordinates": [
      [
        [0.000289999996311962, -7.74911022186279],
        [0.000289999996311962, 89.9657287597656],
        [359.99951171875, 89.9657287597656],
        [359.99951171875, -7.74911022186279],
        [0.000289999996311962, -7.74911022186279]
      ]
    ]
  },
  "collection": "RDPS_CRIM",
  "properties": {
    "cf:parameter": [
      {
        "name": "time",
        "unit": "hours since 2024-01-01 18:00:00"
      },
      {
        "name": "air_pressure",
        "unit": "hPa"
      },
      {
        "name": "grid_latitude",
        "unit": "degrees"
      },
      {
        "name": "grid_longitude",
        "unit": "degrees"
      },
      {
        "name": "atmosphere_hybrid_sigma_ln_pressure_coordinate",
        "unit": ""
      },
      {
        "name": "forecast_period",
        "unit": "hours"
      },
      {
        "name": "forecast_reference_time",
        "unit": "hours since 2024-01-01 06:00:00"
      },
      {
        "name": "longitude",
        "unit": "degrees_east"
      },
      {
        "name": "latitude",
        "unit": "degrees_north"
      }
    ],
    "end_datetime": "2024-01-01T18:00:00Z",
    "cube:variables": {
      "a": {
        "type": "data",
        "unit": "",
        "dimensions": [
          "level"
        ],
        "description": ""
      },
      "b": {
        "type": "data",
        "unit": "",
        "dimensions": [
          "level"
        ],
        "description": ""
      },
      "GZ": {
        "type": "data",
        "unit": "",
        "dimensions": [
          "time",
          "pres",
          "rlat",
          "rlon"
        ],
        "description": ""
      },
      "HU": {
        "type": "data",
        "unit": "",
        "dimensions": [
          "time",
          "pres",
          "rlat",
          "rlon"
        ],
        "description": ""
      },
      "PC": {
        "type": "data",
        "unit": "",
        "dimensions": [
          "time",
          "rlat",
          "rlon"
        ],
        "description": ""
      },
      "PN": {
        "type": "data",
        "unit": "",
        "dimensions": [
          "time",
          "rlat",
          "rlon"
        ],
        "description": ""
      },
      "PR": {
        "type": "data",
        "unit": "",
        "dimensions": [
          "time",
          "rlat",
          "rlon"
        ],
        "description": ""
      },
      "TD": {
        "type": "data",
        "unit": "",
        "dimensions": [
          "time",
          "level",
          "rlat",
          "rlon"
        ],
        "description": ""
      },
      "lat": {
        "type": "auxiliary",
        "unit": "degrees_north",
        "dimensions": [
          "rlat",
          "rlon"
        ],
        "description": "latitude"
      },
      "lon": {
        "type": "auxiliary",
        "unit": "degrees_east",
        "dimensions": [
          "rlat",
          "rlon"
        ],
        "description": "longitude"
      },
      "pref": {
        "type": "data",
        "unit": "hPa",
        "dimensions": [
          ""
        ],
        "description": ""
      },
      "time": {
        "type": "auxiliary",
        "unit": "hours since 2024-01-01 18:00:00",
        "dimensions": [
          "time"
        ],
        "description": "Validity time"
      },
      "level": {
        "type": "auxiliary",
        "unit": "",
        "dimensions": [
          "level"
        ],
        "description": ""
      },
      "reftime": {
        "type": "data",
        "unit": "hours since 2024-01-01 06:00:00",
        "dimensions": [
          ""
        ],
        "description": ""
      },
      "leadtime": {
        "type": "data",
        "unit": "hours",
        "dimensions": [
          "time"
        ],
        "description": "Lead time (since forecast_reference_time)"
      },
      "rotated_pole": {
        "type": "data",
        "unit": "",
        "dimensions": [
          ""
        ],
        "description": ""
      },
      "TT_model_levels": {
        "type": "data",
        "unit": "",
        "dimensions": [
          "time",
          "level",
          "rlat",
          "rlon"
        ],
        "description": ""
      },
      "UU_model_levels": {
        "type": "data",
        "unit": "",
        "dimensions": [
          "time",
          "level",
          "rlat",
          "rlon"
        ],
        "description": ""
      },
      "VV_model_levels": {
        "type": "data",
        "unit": "",
        "dimensions": [
          "time",
          "level",
          "rlat",
          "rlon"
        ],
        "description": ""
      },
      "TT_pressure_levels": {
        "type": "data",
        "unit": "",
        "dimensions": [
          "time",
          "pres",
          "rlat",
          "rlon"
        ],
        "description": ""
      },
      "UU_pressure_levels": {
        "type": "data",
        "unit": "",
        "dimensions": [
          "time",
          "pres",
          "rlat",
          "rlon"
        ],
        "description": ""
      },
      "VV_pressure_levels": {
        "type": "data",
        "unit": "",
        "dimensions": [
          "time",
          "pres",
          "rlat",
          "rlon"
        ],
        "description": ""
      }
    },
    "start_datetime": "2024-01-01T18:00:00Z",
    "cube:dimensions": {
      "pres": {
        "axis": "z",
        "type": "spatial",
        "extent": [null, null],
        "description": "air_pressure"
      },
      "rlat": {
        "axis": "y",
        "type": "spatial",
        "extent": [-7.74911022186279, 89.9657287597656],
        "description": "projection_y_coordinate"
      },
      "rlon": {
        "axis": "x",
        "type": "spatial",
        "extent": [0.000289999996311962, 359.99951171875],
        "description": "projection_x_coordinate"
      }
    }
  },
  "stac_version": "1.0.0",
  "stac_extensions": [
    "https://stac-extensions.github.io/datacube/v2.2.0/schema.json",
    "https://stac-extensions.github.io/cf/v0.2.0/schema.json",
    "https://stac-extensions.github.io/file/v2.1.0/schema.json"
  ]
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmigneault It looks like it already does the right thing, no ?

rlon and rlat are not in variables, but lon lat appear as auxiliary variables.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still this one:

      "time": {
        "type": "auxiliary",
        "unit": "hours since 2024-01-01 18:00:00",
        "dimensions": [
          "time"
        ],
        "description": "Validity time"
      },

RDPS seems to use other names than rlon/rlat, so maybe it doesn't cause and actual "duplicate" match.

        {
          "name": "longitude",
          "unit": "degrees_east"
        },
        {
          "name": "latitude",
          "unit": "degrees_north"
        }

@henriaidasso
Can you provide a HRDPS sample as well?
I think they might be slightly different.

I also notice the unit are not filled. Maybe something to look into.

The extent/values/step properties are not filled, so it seems an xarray call might be needed to provide the values when dimension.name == variable.name.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I misread about extent.

Possible bugs in RDPS case:

{825FAF3E-B1D7-4B2D-A12F-688D82C31A7C} {5FF67A00-6850-40B9-B01B-2F5D9ABBA28B} {FA290933-E591-4C86-B7AD-F4442A8B5680}

There is also a "level" dimension referenced in certain variables, but there does not seem to be a cube:dimensions entry for it (there is even a level variable referencing its inexistent level dimension).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmigneault is there anything I can do on my end to expedite merging this PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the auxiliary variables that map directly to a dimension, I am waiting for feedback on
stac-extensions/datacube#33. The issue was raised during the STAC Community meeting, feedback expected eventually, but not blocker otherwise (the current definitions are not inherently wrong).

At the very least, the errors highlighted in #114 (comment) should be fixed (ie: handling of empty values / not injecting them).

For the empty string dimensions, they look to be single-value constants (reference pressure, reference time). Therefore, their reference value should be explicitly indicated with values: [<the-value>] (maybe need xarray to read them if not already available from NCML).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the mentioned errors to be fixed within my scope, or should I wait for now on it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within this PR since we could flag them through these new datasets.

variables[name] = Variable(
properties=dict(
dimensions=meta["shape"],
dimensions=[] if dimensions == [""] else dimensions,
Copy link
Collaborator Author

@henriaidasso henriaidasso Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmigneault this change fixes the empty string dimension issue in variables. Variables without dimensions now show up as below.

 "rotated_pole": {
    "type": "data",
    "unit": "",
    "dimensions": [], # instead of [""] previously 
    "description": ""
  },

For the extent attribute in dimensions, when I omit setting it for the Z axis (i.e., not putting it in the properties dict), extent still appears with the value [null, null] in the item's JSON. When I use the static method Dimension.from_dict(...) to cast the dimension object to a VerticalSpatialDimension, a missing extent throws RequiredPropertyMissing error which does not align with the DataCube specification where extent is optional for a VerticalSpatialDimension. I suggest we leave the current default value [null, null] and update later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice finding. 👍
Yes, let's move with [null, null] in the meantime.

I logged the issue: stac-utils/pystac#1593

for key, asset in self.item.get_assets().items()
if (service_type is ServiceType and service_type.value in asset.extra_fields)
if (isinstance(service_type, ServiceType) and service_type.value in asset.extra_fields)
or any(ServiceType.from_value(field, default=None) is ServiceType for field in asset.extra_fields)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be changed to isinstance as well since from_value returns an enum instance?

Or I think you can just simplify this to:

any(ServiceType.from_value(field, default=False) for field in asset.extra_fields)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also applies to a few other places. For example: ItemCFExtension.get_assets, ItemCMIP6Extension.get_assets

Could all of these actually just inherit this method from a the base parent class?

"""Extracts cf:parameter-like information from item_data."""
parameters = []

for _, var in self.variables.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use .values() if you're not going to use the key. It makes the intention of the code clearer.

name = attrs.get("standard_name") # Get the required standard name
if not name:
continue # Skip if no valid name
unit = attrs.get("units") or ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CFParameter.unit is annotated as Optional[str] but here we're not accepting a None value. Do we want to make a distinction between None and the empty string? If not, then let's choose to represent nulls as either None or the empty string and make sure the values and annotations are consistent.

Copy link
Collaborator Author

@henriaidasso henriaidasso Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is a mismatch between the CFExtension specification where unit appears to be not required and its validation schema requiring unit to be a string. I will update the annotation to str and raise an issue on CFExtension.

return {
key: asset
for key, asset in self.item.get_assets().items()
if (service_type in ServiceType and service_type.value in asset.extra_fields)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works, but in other get_assets methods for other classes we're doing isinstance(service_type, ServiceType).

I actually like this one better because it supports both of things like "OpenDAP" in ServiceType as well as ServiceType.opendap in ServiceType.

But either way, we should be consistent with how we're doing this code. Either use service_type in ServiceType or isinstance(service_type, ServiceType). But not both.

return None
res = self.session.head(self.access_urls[self.asset_key])
res.raise_for_status()
return int(res.headers.get("Content-Length", None))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int(None) will raise an error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the default for .get is already None so you don't have to specify it again.

@@ -0,0 +1,11 @@
from __future__ import annotations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessary

@@ -0,0 +1,13 @@
from __future__ import annotations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessary


@classmethod
def from_data(cls, data: dict) -> None:
def from_data(cls, data: dict, **kwargs) -> "THREDDSCatalogDataModel":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a good reason to include from __future__ import annotations

Comment on lines +146 to +147
# Cast providers to pystac objects
self._collection_info["providers"] = self.__make_collection_providers()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @henriaidasso I'm just circling back to this. Please let me know if you'd still like to meet about this question. Since this was last discussed, #106 has been accepted.

@henriaidasso
Copy link
Collaborator Author

@mishaschwartz Yes, still on track. This PR has already grown too large in my opinion. I propose closing it and opening a new issue to improve the providers section.

Copy link
Collaborator

@fmigneault fmigneault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor fix left. I think we can merge it once the last few comments are resolved, and do another PR if anything is identified later.

@fmigneault fmigneault merged commit 82ff48d into crim-ca:master Nov 18, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants