Skip to content

Implementing dask.array.coarsen in xarrays #1192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
laliberte opened this issue Jan 4, 2017 · 19 comments
Closed

Implementing dask.array.coarsen in xarrays #1192

laliberte opened this issue Jan 4, 2017 · 19 comments

Comments

@laliberte
Copy link
Contributor

Would it make sense to implement the dask.array.coarsen method on xarrays?

In some ways, coarsen is a generalization of reduce.

Any thoughts?

@laliberte laliberte changed the title Implementing dask.array.coarsen on Implementing dask.array.coarsen in xarrays Jan 4, 2017
@shoyer
Copy link
Member

shoyer commented Jan 4, 2017

This has the feel of a multi-dimensional resampling operation operation. I could potentially see this as part of that interface (e.g., array.resample(x=3, y=3).mean()), but that may be a slightly different operations because resample's window size is in label rather than integer coordinates.

That said, this seems useful and I wouldn't get too hung up about the optimal interface. I would be happy with a coarsen method, though I'd prefer to have an non-dask implementation, too. Potentially we could simply copy in skimage's block_reduce function for the non-dask implementation, which if I recall correctly does some clever tricks with reshaping to do the calculation efficiently.

cc @jhamman who has been thinking about regridding/resampling.

@laliberte
Copy link
Contributor Author

The dask implementation has the following API: dask.array.coarsen(reduction, x, axes, trim_excess=False) so a proposed xarray API could look like: xarray.coarsen(reduction, x, axes, chunks=None, trim_excess=False), resulting in the following implementation:

  1. If the underlying data to x is dask.array, yields x.chunks(chunks).array.coarsen(reduction, axes, trim_excess)
  2. Else, copy the block_reduce function.

Does that fit with the xarray API?

@jhamman
Copy link
Member

jhamman commented Jan 11, 2017

I think this would be a nice feature and something that would fit nicely within xarray. The spatial resampling that I'm working towards is 1) a ways off and 2) quite a bit more domain specific than this. I'm +1!

@PeterDSteinberg
Copy link

Hello @laliberte @shoyer @jhamman . I'm with Continuum and working on NASA funded Earth science ML (see ensemble learning models in github and its documentation here as well as earthio, an experimental repo we have discussed simplifying and transitioning to Xarray - earthio issue 12 and earthio issue 13). We (Continuum NASA, dask, and datashader team members) met with @rabernat this morning and discussed ideas for collaboration better with the Xarray team. I'll comment on more issues more regularly and make some experimental PRs over the next month. I'd like to keep most of the discussion on github issues so it is of general utility, but I'm happy to chat anytime if you want to talk further detail on longer term goals with Xarray.

We can submit a PR on this issue for dask's coarsen and the specs above for using block_reduce in some situations. We have a variety of tasks we are covering now and in a planning / architecture phase for NASA. If we are too slow to respond to this issue, feel free to ping me.

@mrocklin
Copy link
Contributor

I would be happy with a coarsen method, though I'd prefer to have an non-dask implementation, too.

Dask has this actually. We had to build it before we could build the parallel version. See dask/array/chunk.py::coarsen

@laliberte
Copy link
Contributor Author

If it's part of dask then it would be almost trivial to implement in xarray. @mrocklin Can we assume that dask/array/chunk.py::coarsen is part of the public API?

@mrocklin
Copy link
Contributor

My guess is that if you want to avoid a strong dependence on Dask then you'll want to copy the code over regardless.

Historically chunk.py hasn't been considered public (we don't publish docstrings in the docs for example). That being said it hasn't moved in a long while and I don't see any reason for it to move. I'm certainly willing to commit to going through a lengthy deprecation cycle if it does need to move.

@laliberte
Copy link
Contributor Author

The reason I ask is that, ideally, coarsen would work exactly the same with dask.array and np.ndarray data. By using both serial and parallel coarsen methods from dask, we are adding a dependency but we are ensuring forward compatibility. @shoyer, what's your preference? (1) replicate serial coarsen into xarray or (2) point to dask coarsen methods?

@darothen
Copy link

Not to hijack the thread, but @PeterDSteinberg - this is the first I've heard of earthio and I think there would be a lot of interest from the broader atmospheric/oceanic sciences community to hear about what your all's plans are. Could your team do a blog post on Continuum sometime outlining the goals of the project?

@PeterDSteinberg
Copy link

PeterDSteinberg commented May 31, 2017

Hi @darothen earthio is a recent experimental refactor of what was the elm.readers subpackage. elm - Ensemble Learning Models was developed with a Phase I NASA SBIR in 2016 and in part reflects our thinking in late 2015 when xarray was newer and we were planning the proposal. In the last ca. month we have started a Phase II of development on multi-model dask/xarray ML algorithms based on xarray, dask, scikit-learn and a Bokeh maps UI for tasks like land cover classification. I'll add you to elm and feel free to contact me at psteinberg [at] continuum [dot] io. We will do more promotion / blogs in the near term and also in about 12 months we will release a free/open collection of notebooks that form a "Machine Learning with Environmental Data" 3-day course.

Back to the subject matter of the thread.... You can assign the issue to me (can you add me also to xarray repo so I can assign myself things?).. I'll wait to get started until after @shoyer comments on @laliberte 's question:

(1) replicate serial coarsen into xarray or (2) point to dask coarsen methods?

@shoyer
Copy link
Member

shoyer commented May 31, 2017 via email

@shoyer
Copy link
Member

shoyer commented Jun 1, 2017

The dask implementation is short enough that I would certainly reimplement/vendor the pure numpy version for xarray. It might also be worth considering using the related utility skimage.util.view_as_blocks from skimage, which does the necessary reshaping using views (which should be much faster).

@jbusecke
Copy link
Contributor

Is this feature still being considered?
A big +1 from me.

I wrote my own function to achieve this (using dask.array.coarsen), but I was planning to implement a similar functionality in xgcm, and it would be ideal if we could use an upstream implementation from xarray.

@rabernat
Copy link
Contributor

@rabernat
Copy link
Contributor

Just to be clear, my comment above was a joke... @jbusecke and I are good friends! 🤣

@jbusecke
Copy link
Contributor

I should add that I would be happy to work on an implementation, but probably need a good amount of pointers.

Here is the implementation that I have been using (only works with dask.arrays at this point).

Should have posted that earlier to avoid @rabernat s zingers over here.

def aggregate(da, blocks, func=np.nanmean, debug=False):
    """
    Performs efficient block averaging in one or multiple dimensions.
    Only works on regular grid dimensions.
    Parameters
    ----------
    da : xarray DataArray (must be a dask array!)
    blocks : list
        List of tuples containing the dimension and interval to aggregate over
    func : function
        Aggregation function.Defaults to numpy.nanmean
    Returns
    -------
    da_agg : xarray Data
        Aggregated array
    Examples
    --------
    >>> from xarrayutils import aggregate
    >>> import numpy as np
    >>> import xarray as xr
    >>> import matplotlib.pyplot as plt
    >>> %matplotlib inline
    >>> import dask.array as da
    >>> x = np.arange(-10,10)
    >>> y = np.arange(-10,10)
    >>> xx,yy = np.meshgrid(x,y)
    >>> z = xx**2-yy**2
    >>> a = xr.DataArray(da.from_array(z, chunks=(20, 20)),
                         coords={'x':x,'y':y}, dims=['y','x'])
    >>> print a
    <xarray.DataArray 'array-7e422c91624f207a5f7ebac426c01769' (y: 20, x: 20)>
    dask.array<array-7..., shape=(20, 20), dtype=int64, chunksize=(20, 20)>
    Coordinates:
      * y        (y) int64 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
      * x        (x) int64 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
    >>> blocks = [('x',2),('y',5)]
    >>> a_coarse = aggregate(a,blocks,func=np.mean)
    >>> print a_coarse
    <xarray.DataArray 'array-7e422c91624f207a5f7ebac426c01769' (y: 2, x: 10)>
    dask.array<coarsen..., shape=(2, 10), dtype=float64, chunksize=(2, 10)>
    Coordinates:
      * y        (y) int64 -10 0
      * x        (x) int64 -10 -8 -6 -4 -2 0 2 4 6 8
    Attributes:
        Coarsened with: <function mean at 0x111754230>
        Coarsenblocks: [('x', 2), ('y', 10)]
    """
    # Check if the input is a dask array (I might want to convert this
    # automaticlaly in the future)
    if not isinstance(da.data, Array):
        raise RuntimeError('data array data must be a dask array')
    # Check data type of blocks
    # TODO write test
    if (not all(isinstance(n[0], str) for n in blocks) or
            not all(isinstance(n[1], int) for n in blocks)):

        print('blocks input', str(blocks))
        raise RuntimeError("block dimension must be dtype(str), \
        e.g. ('lon',4)")

    # Check if the given array has the dimension specified in blocks
    try:
        block_dict = dict((da.get_axis_num(x), y) for x, y in blocks)
    except ValueError:
        raise RuntimeError("'blocks' contains non matching dimension")

    # Check the size of the excess in each aggregated axis
    blocks = [(a[0], a[1], da.shape[da.get_axis_num(a[0])] % a[1])
              for a in blocks]

    # for now default to trimming the excess
    da_coarse = coarsen(func, da.data, block_dict, trim_excess=True)

    # for now default to only the dims
    new_coords = dict([])
    # for cc in da.coords.keys():
    warnings.warn("WARNING: only dimensions are carried over as coordinates")
    for cc in list(da.dims):
        new_coords[cc] = da.coords[cc]
        for dd in blocks:
            if dd[0] in list(da.coords[cc].dims):
                new_coords[cc] = \
                    new_coords[cc].isel(
                        **{dd[0]: slice(0, -(1 + dd[2]), dd[1])})

    attrs = {'Coarsened with': str(func), 'Coarsenblocks': str(blocks)}
    da_coarse = xr.DataArray(da_coarse, dims=da.dims, coords=new_coords,
                             name=da.name, attrs=attrs)
    return da_coarse

@shoyer
Copy link
Member

shoyer commented Oct 30, 2018

see also #2525

@jhamman
Copy link
Member

jhamman commented Feb 1, 2019

Should this have been closed by #2612?

@dcherian
Copy link
Contributor

dcherian commented Feb 1, 2019

Looks like it

@dcherian dcherian closed this as completed Feb 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants