-
Notifications
You must be signed in to change notification settings - Fork 3
RDPS implementation with additional extensions #114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
STACpopulator/extensions/rdps.py
Outdated
| @@ -0,0 +1,326 @@ | |||
| """RPDS Extension Module.""" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this is so much more complicated than the cordex extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed.
The replicates a lot of things that other helpers or the main populator handles, like extracting the THREDDS NCML details and defining the STAC core properties.
Unless RDPS/HRDPS have specific fields only applicable to them, there might not even be a need for a specific helper for them. Helpers are mostly for mapping STAC extensions to pydantic models. If all available helpers are sufficient, they should be used directly by the populator implementations rather than adding an intermediate helper.
|
Could you provide a bit more context on the choices you made?
|
mishaschwartz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @huard here. I'm not sure exactly how this best fits in with the rest of stac-populator and I think some additional comments/doc-strings/documentation would help explain your motivations.
Please see the comments below for specific questions as well.
pyproject.toml
Outdated
| "pyessv~=0.9", | ||
| "requests~=2.32", | ||
| "lxml~=6.0", | ||
| "dataclasses-json~=0.6.7" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the advantage to using dataclasses-json as opposed to using pydantic to define and serialize data classes like we're already doing elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I don't see the advantage. Simply use pydantic with model_dump/model_dump_json as needed. We don't need yet another method to deal with the classes (pydantic/pystac approaches already needs a lot of workarounds).
STACpopulator/extensions/cf.py
Outdated
| return { | ||
| key: asset | ||
| for key, asset in self.item.get_assets().items() | ||
| if (service_type is ServiceType and service_type.value in asset.extra_fields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this always be False since service_type is an enum value and ServiceType is the class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Copied it from the cmip6.py extension module and didn't pay much attention. But yeah, isinstance() should be used here.
STACpopulator/extensions/contact.py
Outdated
| import pystac | ||
| from dataclasses_json import LetterCase, config, dataclass_json | ||
|
|
||
| SCHEMA_URI = "https://stac-extensions.github.io/contacts/v0.1.1/schema.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if we already have the json schema for this extension is there a reason why we're redefining all the various fields in the dataclasses as well?
I thought the dataclasses were meant to extend the json schemas, not to replicate the same information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! Didn't know that :) Makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema reference must be included in the Collection/Item stac_extensions, otherwise it won't be effective (nor validated during the POST API request).
Normally, the https://github.com/stac-utils/pystac/tree/main/pystac/extensions definition auto-apply this reference when .ext() is invoked on a given pystac.STACObject.
pystac's approach is to extend the object with additional properties and apply any such references within the extension (all self-contained). This avoids having (or forgetting) to manually add item["stac_extensions"].append(SCHEMA_URI). Also, it avoids importing these references for every possible schema to apply. The drawback is that their "extension" classes are not using latest python methods (dataclasses/pydantic models), hence why we have the "helpers" here to do the mapping between the 2 strategies.
As we can see in this PR, the classes below are explicitly imported individually in STACpopulator/populator_base.py and requires various updates to it. This does not extend well with future extensions. They should all be inserted via the "helper" mechanism, and only this "helper" should deal with the mapping of any relevant properties coming from the config, their location in the collection/items, and the insertion of the schema URI.
.gitignore
Outdated
| pyessv-archive/ | ||
|
|
||
| # Temp files downloaded | ||
| temp_files/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a new location that we're storing temporary files on disk? Why not use the users temp director or their cache directory if you want these files to persist for longer?
STACpopulator/extensions/file.py
Outdated
| self.href = href | ||
| self.download_dir = Path(download_dir) | ||
| self.download_dir.mkdir(exist_ok=True) | ||
| self.local_path = self.download_dir / Path(href).name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why downloading a file locally is useful here?
Is this for accessing netcdf files that are not being served by thredds?
If that's the case can we not just download the metadata in the header instead of the whole file since that file could be huge in some cases.
And is there any code that cleans up these files when done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed information on the file (e.g., size in bytes) that are not available in the NCML metadata. I mentioned this during our last meeting.
It would be more efficient to have the info directly from the metadata instead. But maybe I am not looking in the right location?
And yes a cleanup is being implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok that makes sense.
Just a suggestion... you could read the metadata into memory and then count the rest of the bytes in the download stream, calculate the checksum, check the byte order, etc. without actually saving the file to disk.
That way you'd still get all the same file information and the metadata and you wouldn't have to actually write the file and keep it on disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, is there a suggestion to stream the entire content of the server to create the catalog? If yes, this is a big no, no for me. We have hundreds of terabytes of data. Streaming those when creating or updating the catalog will take forever and gobble up bandwidth. The file size is available from the HTML (https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html), so certainly there is an easier way to get it.
I don't know about the checksum. I found this https://docs.unidata.ucar.edu/netcdf-java/current/javadocAll/dap4/core/util/ChecksumMode.html but I don't really understand what it does exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the ChecksumMode allows you to get the checksum for individual variables inside the file. For example:
https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dap4/birdhouse/testdata/xclim/CRCM5/tasmax_bby_198406_se.nc.dmr.xml?dap4.checksum=true#dap4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright thanks! By the way, how to get the byte order? It seems to be the endianness but I have that only for each variable and not for the whole netCDF file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea, sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, using the metadata is much better. Downloads do not scale.
An HTTP HEAD request can be done to return them without the content.
❯ curl --head 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc'
HTTP/1.1 200
Server: nginx/1.23.4
Date: Fri, 10 Oct 2025 22:14:57 GMT
Content-Type: application/x-netcdf
Content-Length: 565270656
[...]@huard
Nice find ;)
Using https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dap4/birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc.dmr.xml?dap4.checksum=true#dap4 (same file as above HEAD request), we can find _DAP4_Checksum_CRC32 entries. However, they are for each distinct variable, not the file as a whole.
Since NetCDF-4 relies on HDF-5, I wonder if its checksum could be retrieved from THREDDS or another way? If not, maybe in could be encoded in an attribute somewhere?
I also found this amongst the response:
<Attribute name="_DAP4_Little_Endian" type="UInt8">
<Value value="1" />
</Attribute>It aligns with https://docs.unidata.ucar.edu/netcdf-c/4.8.1/netcdf_8h.html#ab9a4b12a73eeb0ed8ee5eada6e606132 value.
While playing around in there, I also found this link (UDDC):
https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/uddc/birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc
There is notably creator_..., contributor_..., publisher_..., acknowledgment, license and keywords attributes that are available but unfilled. All of these are what the config YAML was adding manually. Is there any chance those could be set so the populator can extract them directly from the source?
STACpopulator/extensions/file.py
Outdated
| """Initialize the FileHelper object.""" | ||
| self.href = href | ||
| self.download_dir = Path(download_dir) | ||
| self.download_dir.mkdir(exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this to the download function since the existence of the parent directory is needed to download the file.
STACpopulator/extensions/rdps.py
Outdated
| @property | ||
| def uid(self) -> str: | ||
| """Return a unique ID for RDPS data item.""" | ||
| item_url = self.attrs["access_urls"]["HTTPServer"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The httpserver link is not guaranteed to exist. Please use the same pattern as we're using elsewhere to check for alternate access URLs in the case where one doesn't exist
| from STACpopulator.extensions.cordex6 import Cordex6DataModel, Cordex6DataModelNcML | ||
|
|
||
|
|
||
| def get_first_item_attrs(url): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove these functions for the cordex code? This seems unrelated to the other changes here
test.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is empty.
If this is supposed to contain tests, please put those tests in a separate file in the tests directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, some tests must be added. I am not sure what the result looks like currently with RDPS/HRDPS enabled.
fmigneault
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just need some metadata validation with Ouranos to make sure.
| assert "cf:parameter" in item["properties"] | ||
| assert set(["time", "latitude", "longitude"]).issubset(set(parameter_names)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huard
I would like to validate the source metadata for this case.
For example, using:
https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncml/birdhouse/testdata/HRDPS/HRDPS_sample/HRDPS_P_PR_SFC/2024010100.nc
We can see both
<dimension name="rlon" length="2540" />
<dimension name="rlat" length="1290" />
<dimension name="time" length="13" isUnlimited="true" />and
<variable name="lon" shape="rlat rlon" type="float">...</variable>
<variable name="lat" shape="rlat rlon" type="float">...</variable>
<variable name="rlon" shape="rlon" type="float">...</variable>
<variable name="rlat" shape="rlat" type="float">...</variable>
<variable name="time" shape="time" type="int">...</variable>Why are rlon/rlat/time both a variable and a dimension? Is that valid, and what does it imply in terms of the NetCDF? Accessing these variables would contain themselves?
Loading it with xarray, I can indeed do some questionable recursive data["rlon"]["rlon"]["rlon"] calls.
>>> data
<xarray.Dataset> Size: 197MB
Dimensions: (time: 13, rlat: 1290, rlon: 2540)
Coordinates:
* time (time) datetime64[ns] 104B 2024-01-01 ... 2024-01-01T12:0...
* rlat (rlat) float32 5kB -12.3 -12.28 -12.26 ... 16.66 16.68 16.7
* rlon (rlon) float32 10kB -14.82 -14.8 -14.78 ... 42.28 42.31
lon (rlat, rlon) float32 13MB ...
lat (rlat, rlon) float32 13MB ...
Data variables:
rotated_pole float32 4B ...
HRDPS_P_PR_SFC (time, rlat, rlon) float32 170MB ...
Attributes:
Remarks: Variable names are following the convent...
License: These data are provided by the Canadian ...
product: HRDPS
Conventions: CF-1.6
DODS_EXTRA.Unlimited_Dimension: time
>>> data["rlon"]["rlon"]["rlon"]
<xarray.DataArray 'rlon' (rlon: 2540)> Size: 10kB
array([-14.821219, -14.798713, -14.776221, ..., 42.261284, 42.283775,
42.306282], shape=(2540,), dtype=float32)
Coordinates:
* rlon (rlon) float32 10kB -14.82 -14.8 -14.78 ... 42.26 42.28 42.31
Attributes:
long_name: longitude in rotated pole grid
units: degrees
eccc_grid_definition: grtyp: E, ig1: 1430, ig2: 500, ig3: 56000, ig4: 44000
standard_name: grid_longitude
axis: X
_ChunkSizes: 2540
>>> data["rlon"]["rlon"]["rlon"][0]
<xarray.DataArray 'rlon' ()> Size: 4B
array(-14.821219, dtype=float32)
Coordinates:
rlon float32 4B -14.82
Attributes:
long_name: longitude in rotated pole grid
units: degrees
eccc_grid_definition: grtyp: E, ig1: 1430, ig2: 500, ig3: 56000, ig4: 44000
standard_name: grid_longitude
axis: X
_ChunkSizes: 2540I would expect that cf:parameter only contains the actual cube:variables, and those 3 (rlon, rlat, time) are only in cube:dimensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@henriaidasso
Can you post the JSON result of one STAC Item sample?
We would like to validate the results of the produced dimensions/variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you go
{
"id": "birdhouse__testdata__HRDPS__RDPS_sample__2024010106_012.nc",
"bbox": [0.000289999996311962, -7.74911022186279, 359.99951171875, 89.9657287597656],
"type": "Feature",
"links": [
{
"rel": "collection",
"type": "application/json",
"href": "http://localhost:8081/stac/collections/RDPS_CRIM"
},
{
"rel": "parent",
"type": "application/json",
"href": "http://localhost:8081/stac/collections/RDPS_CRIM"
},
{
"rel": "root",
"type": "application/json",
"href": "http://localhost:8081/stac/"
},
{
"rel": "self",
"type": "application/geo+json",
"href": "http://localhost:8081/stac/collections/RDPS_CRIM/items/birdhouse__testdata__HRDPS__RDPS_sample__2024010106_012.nc"
},
{
"rel": "source",
"href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc",
"type": "application/x-netcdf",
"title": "birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc"
}
],
"assets": {
"ISO": {
"href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/iso/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc",
"type": "",
"roles": []
},
"WCS": {
"href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/wcs/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc?request=GetCapabilities",
"type": "application/xml",
"roles": [
"data"
]
},
"WMS": {
"href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/wms/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc?request=GetCapabilities",
"type": "application/xml",
"roles": [
"visual"
]
},
"NcML": {
"href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncml/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc",
"type": "",
"roles": []
},
"UDDC": {
"href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/uddc/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc",
"type": "",
"roles": []
},
"OpenDAP": {
"href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc",
"type": "text/html",
"roles": [
"data"
]
},
"HTTPServer": {
"href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc",
"type": "application/x-netcdf",
"roles": [
"data"
],
"file:size": 137644165,
"cf:parameter": [
{
"name": "time",
"unit": "hours since 2024-01-01 18:00:00"
},
{
"name": "air_pressure",
"unit": "hPa"
},
{
"name": "grid_latitude",
"unit": "degrees"
},
{
"name": "grid_longitude",
"unit": "degrees"
},
{
"name": "atmosphere_hybrid_sigma_ln_pressure_coordinate",
"unit": ""
},
{
"name": "forecast_period",
"unit": "hours"
},
{
"name": "forecast_reference_time",
"unit": "hours since 2024-01-01 06:00:00"
},
{
"name": "longitude",
"unit": "degrees_east"
},
{
"name": "latitude",
"unit": "degrees_north"
}
]
},
"NetcdfSubsetGrid": {
"href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncss/grid/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc",
"type": "application/x-netcdf",
"roles": [
"data"
]
},
"NetcdfSubsetPoint": {
"href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncss/point/birdhouse/testdata/HRDPS/RDPS_sample/2024010106_012.nc",
"type": "application/x-netcdf",
"roles": [
"data"
]
}
},
"geometry": {
"type": "Polygon",
"coordinates": [
[
[0.000289999996311962, -7.74911022186279],
[0.000289999996311962, 89.9657287597656],
[359.99951171875, 89.9657287597656],
[359.99951171875, -7.74911022186279],
[0.000289999996311962, -7.74911022186279]
]
]
},
"collection": "RDPS_CRIM",
"properties": {
"cf:parameter": [
{
"name": "time",
"unit": "hours since 2024-01-01 18:00:00"
},
{
"name": "air_pressure",
"unit": "hPa"
},
{
"name": "grid_latitude",
"unit": "degrees"
},
{
"name": "grid_longitude",
"unit": "degrees"
},
{
"name": "atmosphere_hybrid_sigma_ln_pressure_coordinate",
"unit": ""
},
{
"name": "forecast_period",
"unit": "hours"
},
{
"name": "forecast_reference_time",
"unit": "hours since 2024-01-01 06:00:00"
},
{
"name": "longitude",
"unit": "degrees_east"
},
{
"name": "latitude",
"unit": "degrees_north"
}
],
"end_datetime": "2024-01-01T18:00:00Z",
"cube:variables": {
"a": {
"type": "data",
"unit": "",
"dimensions": [
"level"
],
"description": ""
},
"b": {
"type": "data",
"unit": "",
"dimensions": [
"level"
],
"description": ""
},
"GZ": {
"type": "data",
"unit": "",
"dimensions": [
"time",
"pres",
"rlat",
"rlon"
],
"description": ""
},
"HU": {
"type": "data",
"unit": "",
"dimensions": [
"time",
"pres",
"rlat",
"rlon"
],
"description": ""
},
"PC": {
"type": "data",
"unit": "",
"dimensions": [
"time",
"rlat",
"rlon"
],
"description": ""
},
"PN": {
"type": "data",
"unit": "",
"dimensions": [
"time",
"rlat",
"rlon"
],
"description": ""
},
"PR": {
"type": "data",
"unit": "",
"dimensions": [
"time",
"rlat",
"rlon"
],
"description": ""
},
"TD": {
"type": "data",
"unit": "",
"dimensions": [
"time",
"level",
"rlat",
"rlon"
],
"description": ""
},
"lat": {
"type": "auxiliary",
"unit": "degrees_north",
"dimensions": [
"rlat",
"rlon"
],
"description": "latitude"
},
"lon": {
"type": "auxiliary",
"unit": "degrees_east",
"dimensions": [
"rlat",
"rlon"
],
"description": "longitude"
},
"pref": {
"type": "data",
"unit": "hPa",
"dimensions": [
""
],
"description": ""
},
"time": {
"type": "auxiliary",
"unit": "hours since 2024-01-01 18:00:00",
"dimensions": [
"time"
],
"description": "Validity time"
},
"level": {
"type": "auxiliary",
"unit": "",
"dimensions": [
"level"
],
"description": ""
},
"reftime": {
"type": "data",
"unit": "hours since 2024-01-01 06:00:00",
"dimensions": [
""
],
"description": ""
},
"leadtime": {
"type": "data",
"unit": "hours",
"dimensions": [
"time"
],
"description": "Lead time (since forecast_reference_time)"
},
"rotated_pole": {
"type": "data",
"unit": "",
"dimensions": [
""
],
"description": ""
},
"TT_model_levels": {
"type": "data",
"unit": "",
"dimensions": [
"time",
"level",
"rlat",
"rlon"
],
"description": ""
},
"UU_model_levels": {
"type": "data",
"unit": "",
"dimensions": [
"time",
"level",
"rlat",
"rlon"
],
"description": ""
},
"VV_model_levels": {
"type": "data",
"unit": "",
"dimensions": [
"time",
"level",
"rlat",
"rlon"
],
"description": ""
},
"TT_pressure_levels": {
"type": "data",
"unit": "",
"dimensions": [
"time",
"pres",
"rlat",
"rlon"
],
"description": ""
},
"UU_pressure_levels": {
"type": "data",
"unit": "",
"dimensions": [
"time",
"pres",
"rlat",
"rlon"
],
"description": ""
},
"VV_pressure_levels": {
"type": "data",
"unit": "",
"dimensions": [
"time",
"pres",
"rlat",
"rlon"
],
"description": ""
}
},
"start_datetime": "2024-01-01T18:00:00Z",
"cube:dimensions": {
"pres": {
"axis": "z",
"type": "spatial",
"extent": [null, null],
"description": "air_pressure"
},
"rlat": {
"axis": "y",
"type": "spatial",
"extent": [-7.74911022186279, 89.9657287597656],
"description": "projection_y_coordinate"
},
"rlon": {
"axis": "x",
"type": "spatial",
"extent": [0.000289999996311962, 359.99951171875],
"description": "projection_x_coordinate"
}
}
},
"stac_version": "1.0.0",
"stac_extensions": [
"https://stac-extensions.github.io/datacube/v2.2.0/schema.json",
"https://stac-extensions.github.io/cf/v0.2.0/schema.json",
"https://stac-extensions.github.io/file/v2.1.0/schema.json"
]
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fmigneault It looks like it already does the right thing, no ?
rlon and rlat are not in variables, but lon lat appear as auxiliary variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still this one:
"time": {
"type": "auxiliary",
"unit": "hours since 2024-01-01 18:00:00",
"dimensions": [
"time"
],
"description": "Validity time"
},RDPS seems to use other names than rlon/rlat, so maybe it doesn't cause and actual "duplicate" match.
{
"name": "longitude",
"unit": "degrees_east"
},
{
"name": "latitude",
"unit": "degrees_north"
}@henriaidasso
Can you provide a HRDPS sample as well?
I think they might be slightly different.
I also notice the unit are not filled. Maybe something to look into.
The extent/values/step properties are not filled, so it seems an xarray call might be needed to provide the values when dimension.name == variable.name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fmigneault is there anything I can do on my end to expedite merging this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the auxiliary variables that map directly to a dimension, I am waiting for feedback on
stac-extensions/datacube#33. The issue was raised during the STAC Community meeting, feedback expected eventually, but not blocker otherwise (the current definitions are not inherently wrong).
At the very least, the errors highlighted in #114 (comment) should be fixed (ie: handling of empty values / not injecting them).
For the empty string dimensions, they look to be single-value constants (reference pressure, reference time). Therefore, their reference value should be explicitly indicated with values: [<the-value>] (maybe need xarray to read them if not already available from NCML).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the mentioned errors to be fixed within my scope, or should I wait for now on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Within this PR since we could flag them through these new datasets.
| variables[name] = Variable( | ||
| properties=dict( | ||
| dimensions=meta["shape"], | ||
| dimensions=[] if dimensions == [""] else dimensions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fmigneault this change fixes the empty string dimension issue in variables. Variables without dimensions now show up as below.
"rotated_pole": {
"type": "data",
"unit": "",
"dimensions": [], # instead of [""] previously
"description": ""
},For the extent attribute in dimensions, when I omit setting it for the Z axis (i.e., not putting it in the properties dict), extent still appears with the value [null, null] in the item's JSON. When I use the static method Dimension.from_dict(...) to cast the dimension object to a VerticalSpatialDimension, a missing extent throws RequiredPropertyMissing error which does not align with the DataCube specification where extent is optional for a VerticalSpatialDimension. I suggest we leave the current default value [null, null] and update later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice finding. 👍
Yes, let's move with [null, null] in the meantime.
I logged the issue: stac-utils/pystac#1593
STACpopulator/extensions/base.py
Outdated
| for key, asset in self.item.get_assets().items() | ||
| if (service_type is ServiceType and service_type.value in asset.extra_fields) | ||
| if (isinstance(service_type, ServiceType) and service_type.value in asset.extra_fields) | ||
| or any(ServiceType.from_value(field, default=None) is ServiceType for field in asset.extra_fields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be changed to isinstance as well since from_value returns an enum instance?
Or I think you can just simplify this to:
any(ServiceType.from_value(field, default=False) for field in asset.extra_fields)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also applies to a few other places. For example: ItemCFExtension.get_assets, ItemCMIP6Extension.get_assets
Could all of these actually just inherit this method from a the base parent class?
STACpopulator/extensions/cf.py
Outdated
| """Extracts cf:parameter-like information from item_data.""" | ||
| parameters = [] | ||
|
|
||
| for _, var in self.variables.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use .values() if you're not going to use the key. It makes the intention of the code clearer.
STACpopulator/extensions/cf.py
Outdated
| name = attrs.get("standard_name") # Get the required standard name | ||
| if not name: | ||
| continue # Skip if no valid name | ||
| unit = attrs.get("units") or "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CFParameter.unit is annotated as Optional[str] but here we're not accepting a None value. Do we want to make a distinction between None and the empty string? If not, then let's choose to represent nulls as either None or the empty string and make sure the values and annotations are consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is a mismatch between the CFExtension specification where unit appears to be not required and its validation schema requiring unit to be a string. I will update the annotation to str and raise an issue on CFExtension.
STACpopulator/extensions/cf.py
Outdated
| return { | ||
| key: asset | ||
| for key, asset in self.item.get_assets().items() | ||
| if (service_type in ServiceType and service_type.value in asset.extra_fields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works, but in other get_assets methods for other classes we're doing isinstance(service_type, ServiceType).
I actually like this one better because it supports both of things like "OpenDAP" in ServiceType as well as ServiceType.opendap in ServiceType.
But either way, we should be consistent with how we're doing this code. Either use service_type in ServiceType or isinstance(service_type, ServiceType). But not both.
STACpopulator/extensions/file.py
Outdated
| return None | ||
| res = self.session.head(self.access_urls[self.asset_key]) | ||
| res.raise_for_status() | ||
| return int(res.headers.get("Content-Length", None)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int(None) will raise an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the default for .get is already None so you don't have to specify it again.
STACpopulator/extensions/hrdps.py
Outdated
| @@ -0,0 +1,11 @@ | |||
| from __future__ import annotations | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not necessary
STACpopulator/extensions/rdps.py
Outdated
| @@ -0,0 +1,13 @@ | |||
| from __future__ import annotations | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not necessary
STACpopulator/extensions/thredds.py
Outdated
|
|
||
| @classmethod | ||
| def from_data(cls, data: dict) -> None: | ||
| def from_data(cls, data: dict, **kwargs) -> "THREDDSCatalogDataModel": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a good reason to include from __future__ import annotations
| # Cast providers to pystac objects | ||
| self._collection_info["providers"] = self.__make_collection_providers() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @henriaidasso I'm just circling back to this. Please let me know if you'd still like to meet about this question. Since this was last discussed, #106 has been accepted.
|
@mishaschwartz Yes, still on track. This PR has already grown too large in my opinion. I propose closing it and opening a new issue to improve the providers section. |
fmigneault
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just minor fix left. I think we can merge it once the last few comments are resolved, and do another PR if anything is identified later.



Implementation for RDPS metadata integration. Following metadata/extensions have been added.