Skip to content

Allow passing region to GMTBackendEntrypoint.open_dataset #3932

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
May 16, 2025

Conversation

weiji14
Copy link
Member

@weiji14 weiji14 commented Apr 29, 2025

Description of proposed changes

Support passing in a region as a Sequence [xmin, xmax, ymin, ymax] or ISO country code to xarray.open_dataset when using engine="gmt".

Usage:

import numpy.testing as npt
import xarray as xr

da = xr.open_dataarray("@static_earth_relief.nc", engine="gmt", raster_kind="grid", region=[-52, -48, -18, -12])
assert da.sizes == {"lat": 6, "lon": 4}
npt.assert_allclose(da.lat, [-17.5, -16.5, -15.5, -14.5, -13.5, -12.5])
npt.assert_allclose(da.lon, [-51.5, -50.5, -49.5, -48.5])

This PR also refactors the internals of _load_remote_dataset to use xr.load_dataarray(engine="gmt", ...) instead of low-level calls to clib functions.

Extends #3919, adapted from #3673

Preview:

Reminders

  • Run make format and make check to make sure the code follows the style guide.
  • Add tests for new features or tests that would have caught the bug that you're fixing.
  • Add new public functions/methods/classes to doc/api/index.rst.
  • Write detailed docstrings for all functions/methods.
  • If wrapping a new module, open a 'Wrap new GMT module' issue and submit reasonably-sized PRs.
  • If adding new functionality, add an example to docstrings or tutorials.

Slash Commands

You can write slash commands (/command) in the first line of a comment to perform
specific operations. Supported slash command is:

  • /format: automatically format and lint the code

Support passing in a region as a Sequence [xmin, xmax, ymin, ymax] or ISO country code to `xarray.open_dataset` when using `engine="gmt"`.
@weiji14 weiji14 added the enhancement Improving an existing feature label Apr 29, 2025
@weiji14 weiji14 added this to the 0.16.0 milestone Apr 29, 2025
@weiji14 weiji14 self-assigned this Apr 29, 2025
Remove duplicated code calling GMT read, since `xr.load_dataarray(engine="gmt")` now works with region argument.
Ensure that passing engine='gmt' to xarray.open_dataarray works for opening GeoTIFF
images.
Ensure that passing engine='gmt' to xarray.open_dataarray works to open a GeoTIFF
image.
"""
with xr.open_dataarray("@earth_day_01d", engine="gmt", raster_kind="image") as da:
assert da.sizes == {"band": 3, "y": 180, "x": 360}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coordinate names are y/x when region=None, but lat/lon when region is not None at L90 below. Need to fix this inconsistency.

Comment on lines 147 to 148
# The source grid file is undefined for tiled grids.
assert grid.encoding.get("source") is None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep grid.encoding["source"] as undefined/None for tiled grids (xref #3673 (comment))? Or select the first tile (e.g. S90E000.earth_relief_05m_p.nc)? May need to update this test depending on what we decide.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or select the first tile (e.g. S90E000.earth_relief_05m_p.nc)?

Sounds good.

Comment on lines 120 to 123
source: str | list = which(fname=filename_or_obj)
raster.encoding["source"] = (
source[0] if isinstance(source, list) else source
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need the _ = raster.gmt line at L124 to load GMTDataArray accessor info, since lib.virtualfile_to_raster already calls self.read_virtualfile(vfname, kind=kind).contents.to_xarray() which sets the registration and gtype based on the header?

# Set GMT accessors.
# Must put at the end, otherwise info gets lost after certain grid operations.
grid.gmt.registration = header.registration
grid.gmt.gtype = header.gtype
return grid

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good point. please try remove it and see if everything works fine.

Copy link
Member Author

@weiji14 weiji14 Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one extra test failure at https://github.com/GenericMappingTools/pygmt/actions/runs/14743641047/job/41386753150#step:10:1049:

_____________ [doctest] pygmt.xarray.accessor.GMTDataArrayAccessor _____________
026     Examples
027     --------
028     For GMT's built-in remote datasets, these GMT-specific properties are automatically
029     determined and you can access them as follows:
030 
031     >>> from pygmt.datasets import load_earth_relief
032     >>> # Use the global Earth relief grid with 1 degree spacing
033     >>> grid = load_earth_relief(resolution="01d", registration="pixel")
034     >>> # See if grid uses Gridline or Pixel registration
035     >>> grid.gmt.registration
Expected:
    <GridRegistration.PIXEL: 1>
Got:
    1

I think this might be because the _GMT_GRID_HEADER.registration property returns an int instead of an enum?

# Grid registration, 0 for gridline and 1 for pixel
("registration", ctp.c_uint32),

and we overrode grid.gmt.registration with 1 instead of <GridRegistration.PIXEL: 1>. Should be a quick fix we can do in a separate PR. Edit: no, the GMTDataArrayAccessor registration property should always return an enum, not an int, something else in this PR seems to be affecting this doctest...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the following changes should fix the issue:

with contextlib.suppress(ValueError):
self._registration, self._gtype = map( # type: ignore[assignment]
int, grdinfo(_source, per_column="n").split()[-2:]
)

        if (_source := self._obj.encoding.get("source")) and Path(_source).exists():
            with contextlib.suppress(ValueError):                                      
                registration, gtype = map(  # type: ignore[assignment]
                    int, grdinfo(_source, per_column="n").split()[-2:]
                )                                                                     
            self._registration = GridRegistration(registration)           
            self._gtype = GridType(gtype)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit 07b2802, and also we can remove the # type: ignore[assignment] mypy skip 🎉

Copy link
Member

@seisman seisman May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should declare it an uncaught bug of the previous implementation. In our tests, we have checks like

assert grid.gmt.registration == GridRegistration.GRIDLINE

It's not enough, since the assertion is true when grid.gmt.registration is GridRegistration.GRIDLINE or 0. I think we should improve the existing tests and ensure that .registration and .gtype are in enums, not int, and this should be done in a separate PR.

The test below shows the current, inconsistent behavior:

>>> from pygmt.datasets import load_earth_relief
>>> grid = load_earth_relief(resolution="01d", registration="pixel")
>>> grid.gmt.registration
<GridRegistration.PIXEL: 1>
>>> type(grid.gmt.registration)
<enum 'GridRegistration'>

>>> grid2 = grid[0:10, 0:10]
>>> grid2.gmt.registration
1
>>> type(grid2.gmt.registration)
int

Copy link
Member Author

@weiji14 weiji14 May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like enum comparison should be done using is instead of == according to https://docs.python.org/3/howto/enum.html#comparisons (see also https://stackoverflow.com/questions/25858497/should-enum-instances-be-compared-by-identity-or-equality).

assert grid.gmt.registration is enums.GridRegistration.PIXEL # OK

assert 1 is enums.GridRegistration.PIXEL # AssertionError<>:1: SyntaxWarning: "is" with a literal. Did you mean "=="?

Wish there was a ruff lint for this (xref astral-sh/ruff#11617), but until then, will need to fix it manually.

Edit: PR at #3942

GMTDataArrayAccessor info should already be loaded by calling`virtualfile_to_raster` which calls `self.read_virtualfile(vfname, kind=kind).contents.to_xarray()` that sets registration and gtype from the header.
url = "https://pygmt.org/dev/api/generated/pygmt.GMTBackendEntrypoint.html"

@kwargs_to_strings(region="sequence")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been think if we should avoid using the @kwargs_to_strings decorator in new functions/methods, and instead write a new function like seqjoin which does exactly the same thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to open a separate issue/PR for this.

raster.encoding["source"] = (
source[0] if isinstance(source, list) else source
)
_ = raster.gmt # Load GMTDataArray accessor information
return raster.to_dataset()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it's likely that the accessor information will be lost when converting via to_dataset.

@@ -581,22 +579,9 @@ def _load_remote_dataset(
raise GMTInvalidInput(msg)

fname = f"@{prefix}_{resolution}_{reg}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a lot of error messages like:

Error: h [ERROR]: Tile @S90W180.earth_age_01m_g.nc not found!
Error: h [ERROR]: Tile @S90W150.earth_age_01m_g.nc not found!
Error: h [ERROR]: Tile @S90W120.earth_age_01m_g.nc not found!
Error: h [ERROR]: Tile @S90W090.earth_age_01m_g.nc not found!
Error: h [ERROR]: Tile @S90W060.earth_age_01m_g.nc not found!
Error: h [ERROR]: Tile @S90W030.earth_age_01m_g.nc not found!
Error: h [ERROR]: Tile @S90E000.earth_age_01m_g.nc not found!
Error: h [ERROR]: Tile @S90E030.earth_age_01m_g.nc not found!

This is because, in the GMT backend, we use something like which("@earth_age_01m_g") to get the file path, which doesn't work well for tiled grids.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we used to do this:

    # Full path to the grid if not tiled grids.
    source = which(fname, download="a") if not resinfo.tiled else None
    # Manually add source to xarray.DataArray encoding to make the GMT accessors work.
    if source:
        grid.encoding["source"] = source

i.e. only add the source for non-tiled grids, so that the accessor's which call doesn't report this error. I'm thinking if it's possible to either 1) silence the which call (does verbose="q" work?), or 2) add some heuristic/logic to determine whether the source is a tiled grid before calling which in GMTBackendEntrypoint

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking if it's possible to either 1) silence the which call (does verbose="q" work?), or 2) add some heuristic/logic to determine whether the source is a tiled grid before calling which in GMTBackendEntrypoint

I think either works. Perhaps verbose="q" is easier?

Copy link
Member Author

@weiji14 weiji14 May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit 5557b33.

Edit: Also just realized that verbose="q" was suggested before in #524 (comment).

weiji14 added 3 commits May 9, 2025 15:32
Slicing a tiled grid retains the original source still, but doing math operations like addition cause source to be lost and fallback to default registration/gtype.
Comment on lines 153 to 158
# For a sliced grid, ensure we don't fallback to the default registration (gridline)
# and gtype (cartesian), because the source grid file should still exist.
sliced_grid = grid[1:3, 1:3]
assert sliced_grid.gmt.registration is GridRegistration.GRIDLINE
assert sliced_grid.gmt.gtype is GridType.CARTESIAN

# Still possible to manually set registration and gtype.
sliced_grid.gmt.registration = GridRegistration.PIXEL
sliced_grid.gmt.gtype = GridType.GEOGRAPHIC
assert sliced_grid.encoding["source"].endswith("S90E000.earth_relief_05m_p.nc")
assert sliced_grid.gmt.registration is GridRegistration.PIXEL
assert sliced_grid.gmt.gtype is GridType.GEOGRAPHIC
Copy link
Member Author

@weiji14 weiji14 May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to triple check this part. Do we preserve the source encoding now even after slicing?!! I.e. is #524 fixed? Or is there just some caching going on.

However, doing math operations (e.g. addition (+)) still removes the source.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we preserve the source encoding now even after slicing?!!

I've tried both xarray v2025.03 and v0.15 (in 2020). It seems the source encoding is always kept.

In [1]: import xarray as xr

In [2]: grid = xr.load_dataarray("earth_relief_01d_g.grd")

In [3]: grid.encoding
Out[3]: 
{'zlib': True,
 'szip': False,
 'zstd': False,
 'bzip2': False,
 'blosc': False,
 'shuffle': True,
 'complevel': 9,
 'fletcher32': False,
 'contiguous': False,
 'chunksizes': (181, 181),
 'source': '/home/seisman/.gmt/server/earth/earth_relief/earth_relief_01d_g.grd',
 'original_shape': (181, 361),
 'dtype': dtype('int16'),
 '_FillValue': -32768,
 'scale_factor': 0.5}

In [4]: grid2 = grid[0:10, 0:10]

In [5]: grid2.encoding
Out[5]: 
{'zlib': True,
 'szip': False,
 'zstd': False,
 'bzip2': False,
 'blosc': False,
 'shuffle': True,
 'complevel': 9,
 'fletcher32': False,
 'contiguous': False,
 'chunksizes': (181, 181),
 'source': '/home/seisman/.gmt/server/earth/earth_relief/earth_relief_01d_g.grd',
 'original_shape': (181, 361),
 'dtype': dtype('int16'),
 '_FillValue': -32768,
 'scale_factor': 0.5}

I.e. is #524 fixed?

I think #524 is a different issue. The main point in #524 is that, for slice operations, the accessor information is lost, so we need to call grdinfo on the source grid to determine gtype and registration. In previous versions (before GMT_GRID was implemented), we use temporary files a lot, so the source file may not exists. I guess we need to revisit #524 and see if it can be closed.

Try a different tile to see if it passes on CI
Comment on lines 153 to 158
# For a sliced grid, ensure we don't fallback to the default registration (gridline)
# and gtype (cartesian), because the source grid file should still exist.
sliced_grid = grid[1:3, 1:3]
assert sliced_grid.gmt.registration is GridRegistration.GRIDLINE
assert sliced_grid.gmt.gtype is GridType.CARTESIAN

# Still possible to manually set registration and gtype.
sliced_grid.gmt.registration = GridRegistration.PIXEL
sliced_grid.gmt.gtype = GridType.GEOGRAPHIC
assert sliced_grid.encoding["source"].endswith("S90E000.earth_relief_05m_p.nc")
assert sliced_grid.gmt.registration is GridRegistration.PIXEL
assert sliced_grid.gmt.gtype is GridType.GEOGRAPHIC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we preserve the source encoding now even after slicing?!!

I've tried both xarray v2025.03 and v0.15 (in 2020). It seems the source encoding is always kept.

In [1]: import xarray as xr

In [2]: grid = xr.load_dataarray("earth_relief_01d_g.grd")

In [3]: grid.encoding
Out[3]: 
{'zlib': True,
 'szip': False,
 'zstd': False,
 'bzip2': False,
 'blosc': False,
 'shuffle': True,
 'complevel': 9,
 'fletcher32': False,
 'contiguous': False,
 'chunksizes': (181, 181),
 'source': '/home/seisman/.gmt/server/earth/earth_relief/earth_relief_01d_g.grd',
 'original_shape': (181, 361),
 'dtype': dtype('int16'),
 '_FillValue': -32768,
 'scale_factor': 0.5}

In [4]: grid2 = grid[0:10, 0:10]

In [5]: grid2.encoding
Out[5]: 
{'zlib': True,
 'szip': False,
 'zstd': False,
 'bzip2': False,
 'blosc': False,
 'shuffle': True,
 'complevel': 9,
 'fletcher32': False,
 'contiguous': False,
 'chunksizes': (181, 181),
 'source': '/home/seisman/.gmt/server/earth/earth_relief/earth_relief_01d_g.grd',
 'original_shape': (181, 361),
 'dtype': dtype('int16'),
 '_FillValue': -32768,
 'scale_factor': 0.5}

I.e. is #524 fixed?

I think #524 is a different issue. The main point in #524 is that, for slice operations, the accessor information is lost, so we need to call grdinfo on the source grid to determine gtype and registration. In previous versions (before GMT_GRID was implemented), we use temporary files a lot, so the source file may not exists. I guess we need to revisit #524 and see if it can be closed.

@seisman
Copy link
Member

seisman commented May 10, 2025

Should we keep grid.encoding["source"] as undefined/None for tiled grids (xref #3673 (comment))? Or select the first tile (e.g. S90E000.earth_relief_05m_p.nc)? May need to update this test depending on what we decide.

@weiji14 Perhaps we should cherry-pick commits related to these changes into a separate PR, with a title like "Store the first tile as source encoding for tiled grids" and linking to issue #524.

@seisman seisman added the needs review This PR has higher priority and needs review. label May 12, 2025
weiji14 added a commit that referenced this pull request May 14, 2025
So that GMT accessor info works with tiled grids too. Adapted from #3932.
@weiji14 weiji14 force-pushed the gmtbackendentrypoint/region branch from 573909c to f8af963 Compare May 14, 2025 20:47
Copy link
Member

@seisman seisman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@weiji14 weiji14 added final review call This PR requires final review and approval from a second reviewer and removed needs review This PR has higher priority and needs review. labels May 15, 2025
@weiji14 weiji14 marked this pull request as ready for review May 15, 2025 01:24
@weiji14 weiji14 removed the final review call This PR requires final review and approval from a second reviewer label May 16, 2025
@weiji14 weiji14 merged commit a917dad into main May 16, 2025
29 of 30 checks passed
@weiji14 weiji14 deleted the gmtbackendentrypoint/region branch May 16, 2025 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improving an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants