RDPS implementation with additional extensions #114

henriaidasso · 2025-10-08T15:19:22Z

Implementation for RDPS metadata integration. Following metadata/extensions have been added.

tests/test_cordex.py

huard · 2025-10-09T13:51:32Z

STACpopulator/extensions/rdps.py

@@ -0,0 +1,326 @@
+"""RPDS Extension Module."""


I don't understand why this is so much more complicated than the cordex extension.

Indeed.

The replicates a lot of things that other helpers or the main populator handles, like extracting the THREDDS NCML details and defining the STAC core properties.

Unless RDPS/HRDPS have specific fields only applicable to them, there might not even be a need for a specific helper for them. Helpers are mostly for mapping STAC extensions to pydantic models. If all available helpers are sufficient, they should be used directly by the populator implementations rather than adding an intermediate helper.

huard · 2025-10-09T13:56:27Z

Could you provide a bit more context on the choices you made?

I'm curious to understand why you decided not to reuse the base classes in extensions/base.py that are meant to avoid a lot of the boilerplate.
The THREDDSCatalogDataModel can be subclassed to get a lot done for free. If something is missing, I would prefer adding it to the THREDDSCatalogDataModel once than copy-paste the same code for all extensions.
Also, in my mind, we create an extension when there is a controlled vocabulary to comply with. I don't think there is one for RDPS and HRDPS, so I see them as mere "implementations" relying on other extensions, such as cf, datacube and the new ones you're proposing.

mishaschwartz

I agree with @huard here. I'm not sure exactly how this best fits in with the rest of stac-populator and I think some additional comments/doc-strings/documentation would help explain your motivations.

Please see the comments below for specific questions as well.

docs/RDPS.md

mishaschwartz · 2025-10-08T20:09:49Z

pyproject.toml

    "pyessv~=0.9",
    "requests~=2.32",
    "lxml~=6.0",
+    "dataclasses-json~=0.6.7"


What is the advantage to using dataclasses-json as opposed to using pydantic to define and serialize data classes like we're already doing elsewhere?

Indeed, I don't see the advantage. Simply use pydantic with model_dump/model_dump_json as needed. We don't need yet another method to deal with the classes (pydantic/pystac approaches already needs a lot of workarounds).

mishaschwartz · 2025-10-08T20:20:11Z

STACpopulator/extensions/cf.py

+        return {
+            key: asset
+            for key, asset in self.item.get_assets().items()
+            if (service_type is ServiceType and service_type.value in asset.extra_fields)


Won't this always be False since service_type is an enum value and ServiceType is the class?

You're right. Copied it from the cmip6.py extension module and didn't pay much attention. But yeah, isinstance() should be used here.

mishaschwartz · 2025-10-08T20:22:02Z

STACpopulator/extensions/contact.py

+import pystac
+from dataclasses_json import LetterCase, config, dataclass_json
+
+SCHEMA_URI = "https://stac-extensions.github.io/contacts/v0.1.1/schema.json"


Is this used anywhere?

Also if we already have the json schema for this extension is there a reason why we're redefining all the various fields in the dataclasses as well?

I thought the dataclasses were meant to extend the json schemas, not to replicate the same information.

Ah! Didn't know that :) Makes sense.

The schema reference must be included in the Collection/Item stac_extensions, otherwise it won't be effective (nor validated during the POST API request).

Normally, the https://github.com/stac-utils/pystac/tree/main/pystac/extensions definition auto-apply this reference when .ext() is invoked on a given pystac.STACObject.

pystac's approach is to extend the object with additional properties and apply any such references within the extension (all self-contained). This avoids having (or forgetting) to manually add item["stac_extensions"].append(SCHEMA_URI). Also, it avoids importing these references for every possible schema to apply. The drawback is that their "extension" classes are not using latest python methods (dataclasses/pydantic models), hence why we have the "helpers" here to do the mapping between the 2 strategies.

As we can see in this PR, the classes below are explicitly imported individually in STACpopulator/populator_base.py and requires various updates to it. This does not extend well with future extensions. They should all be inserted via the "helper" mechanism, and only this "helper" should deal with the mapping of any relevant properties coming from the config, their location in the collection/items, and the insertion of the schema URI.

mishaschwartz · 2025-10-09T12:59:18Z

.gitignore

 pyessv-archive/
+
+# Temp files downloaded
+temp_files/


Is this a new location that we're storing temporary files on disk? Why not use the users temp director or their cache directory if you want these files to persist for longer?

mishaschwartz · 2025-10-09T13:01:19Z

STACpopulator/extensions/file.py

+        self.href = href
+        self.download_dir = Path(download_dir)
+        self.download_dir.mkdir(exist_ok=True)
+        self.local_path = self.download_dir / Path(href).name


Can you explain why downloading a file locally is useful here?

Is this for accessing netcdf files that are not being served by thredds?

If that's the case can we not just download the metadata in the header instead of the whole file since that file could be huge in some cases.

And is there any code that cleans up these files when done?

I needed information on the file (e.g., size in bytes) that are not available in the NCML metadata. I mentioned this during our last meeting.
It would be more efficient to have the info directly from the metadata instead. But maybe I am not looking in the right location?

And yes a cleanup is being implemented.

Ah ok that makes sense.

Just a suggestion... you could read the metadata into memory and then count the rest of the bytes in the download stream, calculate the checksum, check the byte order, etc. without actually saving the file to disk.

That way you'd still get all the same file information and the metadata and you wouldn't have to actually write the file and keep it on disk.

Wait, is there a suggestion to stream the entire content of the server to create the catalog? If yes, this is a big no, no for me. We have hundreds of terabytes of data. Streaming those when creating or updating the catalog will take forever and gobble up bandwidth. The file size is available from the HTML (https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html), so certainly there is an easier way to get it.
I don't know about the checksum. I found this https://docs.unidata.ucar.edu/netcdf-java/current/javadocAll/dap4/core/util/ChecksumMode.html but I don't really understand what it does exactly.

So the ChecksumMode allows you to get the checksum for individual variables inside the file. For example:
https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dap4/birdhouse/testdata/xclim/CRCM5/tasmax_bby_198406_se.nc.dmr.xml?dap4.checksum=true#dap4

Alright thanks! By the way, how to get the byte order? It seems to be the endianness but I have that only for each variable and not for the whole netCDF file.

No idea, sorry.

Definitely, using the metadata is much better. Downloads do not scale.
An HTTP HEAD request can be done to return them without the content.

❯ curl --head 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc' HTTP/1.1 200 Server: nginx/1.23.4 Date: Fri, 10 Oct 2025 22:14:57 GMT Content-Type: application/x-netcdf Content-Length: 565270656 [...]

@huard
Nice find ;)

Using https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dap4/birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc.dmr.xml?dap4.checksum=true#dap4 (same file as above HEAD request), we can find _DAP4_Checksum_CRC32 entries. However, they are for each distinct variable, not the file as a whole.

Since NetCDF-4 relies on HDF-5, I wonder if its checksum could be retrieved from THREDDS or another way? If not, maybe in could be encoded in an attribute somewhere?

I also found this amongst the response:

<Attribute name="_DAP4_Little_Endian" type="UInt8"> <Value value="1" /> </Attribute>

It aligns with https://docs.unidata.ucar.edu/netcdf-c/4.8.1/netcdf_8h.html#ab9a4b12a73eeb0ed8ee5eada6e606132 value.

While playing around in there, I also found this link (UDDC):
https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/uddc/birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc
There is notably creator_..., contributor_..., publisher_..., acknowledgment, license and keywords attributes that are available but unfilled. All of these are what the config YAML was adding manually. Is there any chance those could be set so the populator can extract them directly from the source?

mishaschwartz · 2025-10-09T13:08:18Z

STACpopulator/extensions/file.py

+        """Initialize the FileHelper object."""
+        self.href = href
+        self.download_dir = Path(download_dir)
+        self.download_dir.mkdir(exist_ok=True)


I would move this to the download function since the existence of the parent directory is needed to download the file.

mishaschwartz · 2025-10-09T13:11:38Z

STACpopulator/extensions/rdps.py

+    @property
+    def uid(self) -> str:
+        """Return a unique ID for RDPS data item."""
+        item_url = self.attrs["access_urls"]["HTTPServer"]


The httpserver link is not guaranteed to exist. Please use the same pattern as we're using elsewhere to check for alternate access URLs in the case where one doesn't exist

mishaschwartz · 2025-10-09T13:18:07Z

tests/test_cordex.py

 from STACpopulator.extensions.cordex6 import Cordex6DataModel, Cordex6DataModelNcML


-def get_first_item_attrs(url):


Why remove these functions for the cordex code? This seems unrelated to the other changes here

mishaschwartz · 2025-10-09T13:19:33Z

test.py

This file is empty.

If this is supposed to contain tests, please put those tests in a separate file in the tests directory

Indeed, some tests must be added. I am not sure what the result looks like currently with RDPS/HRDPS enabled.

fmigneault

Looks good. Just need some metadata validation with Ouranos to make sure.

fmigneault · 2025-10-31T18:27:28Z

tests/test_rdps.py

+    assert "cf:parameter" in item["properties"]
+    assert set(["time", "latitude", "longitude"]).issubset(set(parameter_names))


@huard
I would like to validate the source metadata for this case.

For example, using:
https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncml/birdhouse/testdata/HRDPS/HRDPS_sample/HRDPS_P_PR_SFC/2024010100.nc

We can see both

<dimension name="rlon" length="2540" /> <dimension name="rlat" length="1290" /> <dimension name="time" length="13" isUnlimited="true" />

and

<variable name="lon" shape="rlat rlon" type="float">...</variable> <variable name="lat" shape="rlat rlon" type="float">...</variable> <variable name="rlon" shape="rlon" type="float">...</variable> <variable name="rlat" shape="rlat" type="float">...</variable> <variable name="time" shape="time" type="int">...</variable>

Why are rlon/rlat/time both a variable and a dimension? Is that valid, and what does it imply in terms of the NetCDF? Accessing these variables would contain themselves?

Loading it with xarray, I can indeed do some questionable recursive data["rlon"]["rlon"]["rlon"] calls.

>>> data <xarray.Dataset> Size: 197MB Dimensions: (time: 13, rlat: 1290, rlon: 2540) Coordinates: * time (time) datetime64[ns] 104B 2024-01-01 ... 2024-01-01T12:0... * rlat (rlat) float32 5kB -12.3 -12.28 -12.26 ... 16.66 16.68 16.7 * rlon (rlon) float32 10kB -14.82 -14.8 -14.78 ... 42.28 42.31 lon (rlat, rlon) float32 13MB ... lat (rlat, rlon) float32 13MB ... Data variables: rotated_pole float32 4B ... HRDPS_P_PR_SFC (time, rlat, rlon) float32 170MB ... Attributes: Remarks: Variable names are following the convent... License: These data are provided by the Canadian ... product: HRDPS Conventions: CF-1.6 DODS_EXTRA.Unlimited_Dimension: time >>> data["rlon"]["rlon"]["rlon"] <xarray.DataArray 'rlon' (rlon: 2540)> Size: 10kB array([-14.821219, -14.798713, -14.776221, ..., 42.261284, 42.283775, 42.306282], shape=(2540,), dtype=float32) Coordinates: * rlon (rlon) float32 10kB -14.82 -14.8 -14.78 ... 42.26 42.28 42.31 Attributes: long_name: longitude in rotated pole grid units: degrees eccc_grid_definition: grtyp: E, ig1: 1430, ig2: 500, ig3: 56000, ig4: 44000 standard_name: grid_longitude axis: X _ChunkSizes: 2540 >>> data["rlon"]["rlon"]["rlon"][0] <xarray.DataArray 'rlon' ()> Size: 4B array(-14.821219, dtype=float32) Coordinates: rlon float32 4B -14.82 Attributes: long_name: longitude in rotated pole grid units: degrees eccc_grid_definition: grtyp: E, ig1: 1430, ig2: 500, ig3: 56000, ig4: 44000 standard_name: grid_longitude axis: X _ChunkSizes: 2540

I would expect that cf:parameter only contains the actual cube:variables, and those 3 (rlon, rlat, time) are only in cube:dimensions.

@henriaidasso
Can you post the JSON result of one STAC Item sample?
We would like to validate the results of the produced dimensions/variables.

Here you go

{ "id": "birdhouse__testdata__HRDPS__RDPS_sample__2024010106_012.nc", "bbox": [0.000289999996311962, -7.74911022186279, 359.99951171875, 89.9657287597656], "type": "Feature", "links": [ { "rel": "collection", "type": "application/json", "href": "http://localhost:8081/stac/collections/RDPS_CRIM" }, { "rel": "parent", "type": "application/json", "href": "http://localhost:8081/stac/collections/RDPS_CRIM" }, { "rel": "root", "type": "application/json", "href": "http://localhost:8081/stac/" }, { "rel": "self", "type": "application/geo+json", "href": "http://localhost:8081/stac/collections/RDPS_CRIM/items/birdhouse__testdata__HRDPS__RDPS_sample__2024010106_012.nc" }, { "rel": "source", "href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc", "type": "application/x-netcdf", "title": "birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc" } ], "assets": { "ISO": { "href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/iso/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc", "type": "", "roles": [] }, "WCS": { "href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/wcs/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc?request=GetCapabilities", "type": "application/xml", "roles": [ "data" ] }, "WMS": { "href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/wms/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc?request=GetCapabilities", "type": "application/xml", "roles": [ "visual" ] }, "NcML": { "href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncml/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc", "type": "", "roles": [] }, "UDDC": { "href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/uddc/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc", "type": "", "roles": [] }, "OpenDAP": { "href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc", "type": "text/html", "roles": [ "data" ] }, "HTTPServer": { "href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc", "type": "application/x-netcdf", "roles": [ "data" ], "file:size": 137644165, "cf:parameter": [ { "name": "time", "unit": "hours since 2024-01-01 18:00:00" }, { "name": "air_pressure", "unit": "hPa" }, { "name": "grid_latitude", "unit": "degrees" }, { "name": "grid_longitude", "unit": "degrees" }, { "name": "atmosphere_hybrid_sigma_ln_pressure_coordinate", "unit": "" }, { "name": "forecast_period", "unit": "hours" }, { "name": "forecast_reference_time", "unit": "hours since 2024-01-01 06:00:00" }, { "name": "longitude", "unit": "degrees_east" }, { "name": "latitude", "unit": "degrees_north" } ] }, "NetcdfSubsetGrid": { "href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncss/grid/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc", "type": "application/x-netcdf", "roles": [ "data" ] }, "NetcdfSubsetPoint": { "href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncss/point/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc", "type": "application/x-netcdf", "roles": [ "data" ] } }, "geometry": { "type": "Polygon", "coordinates": [ [ [0.000289999996311962, -7.74911022186279], [0.000289999996311962, 89.9657287597656], [359.99951171875, 89.9657287597656], [359.99951171875, -7.74911022186279], [0.000289999996311962, -7.74911022186279] ] ] }, "collection": "RDPS_CRIM", "properties": { "cf:parameter": [ { "name": "time", "unit": "hours since 2024-01-01 18:00:00" }, { "name": "air_pressure", "unit": "hPa" }, { "name": "grid_latitude", "unit": "degrees" }, { "name": "grid_longitude", "unit": "degrees" }, { "name": "atmosphere_hybrid_sigma_ln_pressure_coordinate", "unit": "" }, { "name": "forecast_period", "unit": "hours" }, { "name": "forecast_reference_time", "unit": "hours since 2024-01-01 06:00:00" }, { "name": "longitude", "unit": "degrees_east" }, { "name": "latitude", "unit": "degrees_north" } ], "end_datetime": "2024-01-01T18:00:00Z", "cube:variables": { "a": { "type": "data", "unit": "", "dimensions": [ "level" ], "description": "" }, "b": { "type": "data", "unit": "", "dimensions": [ "level" ], "description": "" }, "GZ": { "type": "data", "unit": "", "dimensions": [ "time", "pres", "rlat", "rlon" ], "description": "" }, "HU": { "type": "data", "unit": "", "dimensions": [ "time", "pres", "rlat", "rlon" ], "description": "" }, "PC": { "type": "data", "unit": "", "dimensions": [ "time", "rlat", "rlon" ], "description": "" }, "PN": { "type": "data", "unit": "", "dimensions": [ "time", "rlat", "rlon" ], "description": "" }, "PR": { "type": "data", "unit": "", "dimensions": [ "time", "rlat", "rlon" ], "description": "" }, "TD": { "type": "data", "unit": "", "dimensions": [ "time", "level", "rlat", "rlon" ], "description": "" }, "lat": { "type": "auxiliary", "unit": "degrees_north", "dimensions": [ "rlat", "rlon" ], "description": "latitude" }, "lon": { "type": "auxiliary", "unit": "degrees_east", "dimensions": [ "rlat", "rlon" ], "description": "longitude" }, "pref": { "type": "data", "unit": "hPa", "dimensions": [ "" ], "description": "" }, "time": { "type": "auxiliary", "unit": "hours since 2024-01-01 18:00:00", "dimensions": [ "time" ], "description": "Validity time" }, "level": { "type": "auxiliary", "unit": "", "dimensions": [ "level" ], "description": "" }, "reftime": { "type": "data", "unit": "hours since 2024-01-01 06:00:00", "dimensions": [ "" ], "description": "" }, "leadtime": { "type": "data", "unit": "hours", "dimensions": [ "time" ], "description": "Lead time (since forecast_reference_time)" }, "rotated_pole": { "type": "data", "unit": "", "dimensions": [ "" ], "description": "" }, "TT_model_levels": { "type": "data", "unit": "", "dimensions": [ "time", "level", "rlat", "rlon" ], "description": "" }, "UU_model_levels": { "type": "data", "unit": "", "dimensions": [ "time", "level", "rlat", "rlon" ], "description": "" }, "VV_model_levels": { "type": "data", "unit": "", "dimensions": [ "time", "level", "rlat", "rlon" ], "description": "" }, "TT_pressure_levels": { "type": "data", "unit": "", "dimensions": [ "time", "pres", "rlat", "rlon" ], "description": "" }, "UU_pressure_levels": { "type": "data", "unit": "", "dimensions": [ "time", "pres", "rlat", "rlon" ], "description": "" }, "VV_pressure_levels": { "type": "data", "unit": "", "dimensions": [ "time", "pres", "rlat", "rlon" ], "description": "" } }, "start_datetime": "2024-01-01T18:00:00Z", "cube:dimensions": { "pres": { "axis": "z", "type": "spatial", "extent": [null, null], "description": "air_pressure" }, "rlat": { "axis": "y", "type": "spatial", "extent": [-7.74911022186279, 89.9657287597656], "description": "projection_y_coordinate" }, "rlon": { "axis": "x", "type": "spatial", "extent": [0.000289999996311962, 359.99951171875], "description": "projection_x_coordinate" } } }, "stac_version": "1.0.0", "stac_extensions": [ "https://stac-extensions.github.io/datacube/v2.2.0/schema.json", "https://stac-extensions.github.io/cf/v0.2.0/schema.json", "https://stac-extensions.github.io/file/v2.1.0/schema.json" ] }

@fmigneault It looks like it already does the right thing, no ?

rlon and rlat are not in variables, but lon lat appear as auxiliary variables.

There is still this one:

"time": { "type": "auxiliary", "unit": "hours since 2024-01-01 18:00:00", "dimensions": [ "time" ], "description": "Validity time" },

RDPS seems to use other names than rlon/rlat, so maybe it doesn't cause and actual "duplicate" match.

{ "name": "longitude", "unit": "degrees_east" }, { "name": "latitude", "unit": "degrees_north" }

@henriaidasso
Can you provide a HRDPS sample as well?
I think they might be slightly different.

I also notice the unit are not filled. Maybe something to look into.

The extent/values/step properties are not filled, so it seems an xarray call might be needed to provide the values when dimension.name == variable.name.

Indeed. I misread about extent.

Possible bugs in RDPS case:

There is also a "level" dimension referenced in certain variables, but there does not seem to be a cube:dimensions entry for it (there is even a level variable referencing its inexistent level dimension).

@fmigneault is there anything I can do on my end to expedite merging this PR?

For the auxiliary variables that map directly to a dimension, I am waiting for feedback on
stac-extensions/datacube#33. The issue was raised during the STAC Community meeting, feedback expected eventually, but not blocker otherwise (the current definitions are not inherently wrong).

At the very least, the errors highlighted in #114 (comment) should be fixed (ie: handling of empty values / not injecting them).

For the empty string dimensions, they look to be single-value constants (reference pressure, reference time). Therefore, their reference value should be explicitly indicated with values: [<the-value>] (maybe need xarray to read them if not already available from NCML).

Are the mentioned errors to be fixed within my scope, or should I wait for now on it?

Within this PR since we could flag them through these new datasets.

henriaidasso · 2025-11-11T22:58:20Z

STACpopulator/extensions/datacube.py

            variables[name] = Variable(
                properties=dict(
-                    dimensions=meta["shape"],
+                    dimensions=[] if dimensions == [""] else dimensions,


@fmigneault this change fixes the empty string dimension issue in variables. Variables without dimensions now show up as below.

"rotated_pole": { "type": "data", "unit": "", "dimensions": [], # instead of [""] previously "description": "" },

For the extent attribute in dimensions, when I omit setting it for the Z axis (i.e., not putting it in the properties dict), extent still appears with the value [null, null] in the item's JSON. When I use the static method Dimension.from_dict(...) to cast the dimension object to a VerticalSpatialDimension, a missing extent throws RequiredPropertyMissing error which does not align with the DataCube specification where extent is optional for a VerticalSpatialDimension. I suggest we leave the current default value [null, null] and update later.

Nice finding. 👍
Yes, let's move with [null, null] in the meantime.

I logged the issue: stac-utils/pystac#1593

CHANGES.md

mishaschwartz · 2025-11-17T15:01:51Z

STACpopulator/extensions/base.py

            for key, asset in self.item.get_assets().items()
-            if (service_type is ServiceType and service_type.value in asset.extra_fields)
+            if (isinstance(service_type, ServiceType) and service_type.value in asset.extra_fields)
            or any(ServiceType.from_value(field, default=None) is ServiceType for field in asset.extra_fields)


shouldn't this be changed to isinstance as well since from_value returns an enum instance?

Or I think you can just simplify this to:

any(ServiceType.from_value(field, default=False) for field in asset.extra_fields)

This also applies to a few other places. For example: ItemCFExtension.get_assets, ItemCMIP6Extension.get_assets

Could all of these actually just inherit this method from a the base parent class?

mishaschwartz · 2025-11-17T15:03:49Z

STACpopulator/extensions/cf.py

+        """Extracts cf:parameter-like information from item_data."""
+        parameters = []
+
+        for _, var in self.variables.items():


Just use .values() if you're not going to use the key. It makes the intention of the code clearer.

mishaschwartz · 2025-11-17T15:07:04Z

STACpopulator/extensions/cf.py

+            name = attrs.get("standard_name")  # Get the required standard name
+            if not name:
+                continue  # Skip if no valid name
+            unit = attrs.get("units") or ""


The CFParameter.unit is annotated as Optional[str] but here we're not accepting a None value. Do we want to make a distinction between None and the empty string? If not, then let's choose to represent nulls as either None or the empty string and make sure the values and annotations are consistent.

Yes, there is a mismatch between the CFExtension specification where unit appears to be not required and its validation schema requiring unit to be a string. I will update the annotation to str and raise an issue on CFExtension.

mishaschwartz · 2025-11-17T15:18:25Z

STACpopulator/extensions/cf.py

+        return {
+            key: asset
+            for key, asset in self.item.get_assets().items()
+            if (service_type in ServiceType and service_type.value in asset.extra_fields)


This works, but in other get_assets methods for other classes we're doing isinstance(service_type, ServiceType).

I actually like this one better because it supports both of things like "OpenDAP" in ServiceType as well as ServiceType.opendap in ServiceType.

But either way, we should be consistent with how we're doing this code. Either use service_type in ServiceType or isinstance(service_type, ServiceType). But not both.

mishaschwartz · 2025-11-17T15:40:09Z

STACpopulator/extensions/file.py

+            return None
+        res = self.session.head(self.access_urls[self.asset_key])
+        res.raise_for_status()
+        return int(res.headers.get("Content-Length", None))


int(None) will raise an error

Also the default for .get is already None so you don't have to specify it again.

mishaschwartz · 2025-11-17T15:44:32Z

STACpopulator/extensions/hrdps.py

@@ -0,0 +1,11 @@
+from __future__ import annotations


not necessary

mishaschwartz · 2025-11-17T15:44:53Z

STACpopulator/extensions/rdps.py

@@ -0,0 +1,13 @@
+from __future__ import annotations


not necessary

mishaschwartz · 2025-11-17T15:46:22Z

STACpopulator/extensions/thredds.py


    @classmethod
-    def from_data(cls, data: dict) -> None:
+    def from_data(cls, data: dict, **kwargs) -> "THREDDSCatalogDataModel":


Here's a good reason to include from __future__ import annotations

mishaschwartz · 2025-11-17T15:52:08Z

STACpopulator/populator_base.py

+        # Cast providers to pystac objects
+        self._collection_info["providers"] = self.__make_collection_providers()


Hi @henriaidasso I'm just circling back to this. Please let me know if you'd still like to meet about this question. Since this was last discussed, #106 has been accepted.

henriaidasso · 2025-11-17T17:03:40Z

@mishaschwartz Yes, still on track. This PR has already grown too large in my opinion. I propose closing it and opening a new issue to improve the providers section.

fmigneault

Just minor fix left. I think we can merge it once the last few comments are resolved, and do another PR if anything is identified later.

CHANGES.md

henriaidasso added 22 commits September 23, 2025 10:47

fix: add small fixes to create Items out of RDPS and HRDPS data sets …

d737a9d

…from PR crim-ca#104

feat: rdps extension added

6bde69e

feat: rdps collection config added

911cbe9

feat: rdps implementation added

b5f0de9

fix: export rdps implementation for cli access

76b37a7

fix: clean unused code comments

5e5541a

fix: clean unused code

4d6aba0

fix: filename for item id updated

3866246

fix: item validation added

138f0dd

fix: item validation added

cd5e801

fix: add properties to item

eb86c50

feat: hrdps ext. and impl. added

bfdd825

fix: integrate properties

49804f5

fix: add providers handling

99868fc

feat: cf extension added

24066be

fix: make providers optional

90c30c1

feat: file info extension added

9ea89d7

feat: dataclasses-json library added

fc8806b

fix: contacts extension added

ad2bba9

fix: file info extension added

6c45ef2

fix clean cf extension code

730763c

fix: make contacts optional

0ca5211

henriaidasso requested review from fmigneault, huard and mishaschwartz as code owners October 8, 2025 15:19

Merge branch 'master' into rdps

2e6a69e

huard reviewed Oct 9, 2025

View reviewed changes

tests/test_cordex.py Show resolved Hide resolved

huard reviewed Oct 9, 2025

View reviewed changes

mishaschwartz reviewed Oct 9, 2025

View reviewed changes

ahenrij added 2 commits October 31, 2025 00:53

fix: collection contacts updated

660265d

fix: rdps tests updated to check added fields

afbe940

fmigneault reviewed Oct 31, 2025

View reviewed changes

ahenrij added 3 commits October 31, 2025 14:29

fix: add crim as indexer in providers

64c33b6

fix: add crim as indexer in contacts only

b73c216

fix: license links updated

5b668e9

fmigneault mentioned this pull request Oct 31, 2025

Representation of dimension/variable auxiliary coordinates stac-extensions/datacube#33

Open

fmigneault requested a review from Nazim-crim October 31, 2025 21:48

henriaidasso mentioned this pull request Nov 5, 2025

Documentation for populators #118

Merged

ahenrij added 2 commits November 11, 2025 17:11

fix: dimensions and variables updated

84ab682

fix: dimensions and variables updated

1a0b8aa

henriaidasso commented Nov 11, 2025

View reviewed changes

henriaidasso and others added 3 commits November 12, 2025 10:16

Merge branch 'master' into rdps

4366e6b

fix: precommit run

227a7d5

Merge branch 'master' into rdps

30054ea

henriaidasso requested a review from fmigneault November 17, 2025 05:43

mishaschwartz requested changes Nov 17, 2025

View reviewed changes

ahenrij added 5 commits November 17, 2025 11:32

fix: updated get assets methods for uniformity

0dadae7

fix: cf iterate on values updated

49f0d4a

fix: unit str to match validation schema

185ddac

fix: return none for file content-length issues

8cdbc59

fix: remove unused future imports and adds in thredds

3f033c2

henriaidasso requested a review from mishaschwartz November 17, 2025 17:03

Merge branch 'master' into rdps

1fea5bd

fmigneault reviewed Nov 17, 2025

View reviewed changes

CHANGES.md Outdated Show resolved Hide resolved

fix: changes version fix

07696c4

fmigneault approved these changes Nov 18, 2025

View reviewed changes

mishaschwartz approved these changes Nov 18, 2025

View reviewed changes

fmigneault merged commit 82ff48d into crim-ca:master Nov 18, 2025
6 checks passed

		from STACpopulator.extensions.cordex6 import Cordex6DataModel, Cordex6DataModelNcML


		def get_first_item_attrs(url):

		assert "cf:parameter" in item["properties"]
		assert set(["time", "latitude", "longitude"]).issubset(set(parameter_names))

		# Cast providers to pystac objects
		self._collection_info["providers"] = self.__make_collection_providers()

RDPS implementation with additional extensions #114

RDPS implementation with additional extensions #114

Uh oh!

Conversation

henriaidasso commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huard commented Oct 9, 2025

Uh oh!

mishaschwartz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

henriaidasso Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mishaschwartz Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fmigneault left a comment

Choose a reason for hiding this comment

Uh oh!

fmigneault Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

henriaidasso commented Oct 8, 2025 •

edited

Loading

henriaidasso Oct 9, 2025 •

edited

Loading

mishaschwartz Oct 9, 2025 •

edited

Loading

fmigneault Oct 31, 2025 •

edited

Loading

henriaidasso Nov 11, 2025 •

edited

Loading

henriaidasso Nov 17, 2025 •

edited

Loading