Skip to content

Multi-dimensional binning/resampling/coarsening #2525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
saviokay opened this issue Oct 29, 2018 · 16 comments
Closed

Multi-dimensional binning/resampling/coarsening #2525

saviokay opened this issue Oct 29, 2018 · 16 comments

Comments

@saviokay
Copy link

saviokay commented Oct 29, 2018

Code Sample, a copy-pastable example if possible

The StackOverflow Query : https://stackoverflow.com/questions/52886703/xarray-multidimensional-binning-array-reduction-on-sample-dataset-of-4-x4-to/52981916#52981916

import pandas as pd
import numpy as np
import xarray as xr
a = np.array(np.random.randint(1, 90+1,(4,4)),dtype=np.float64)
b = np.array(np.random.randint(1, 360+1,(4,4)),dtype=np.float64)
c = np.random.random_sample(16,)
c = c.reshape(4,4)
dsa = xr.Dataset()
dsa['CloudFraction'] = (('x', 'y'), c)
dsa.coords['latitude'] = (('x', 'y'), a)
dsa.coords['longitude'] = (('x', 'y'), b)
dsa


Dimensions: (x: 4, y: 4)
Coordinates:
latitude       (x, y) float64 23.0 16.0 53.0 1.0 ... 82.0 65.0 45.0 88.0
longitude      (x, y) float64 219.0 13.0 276.0 69.0 ... 156.0 277.0 16.0
Dimensions without coordinates: x, y
Data variables:
CloudFraction  (x, y) float64 0.1599 0.05671 0.8624 ... 0.7757 0.7572

Problem description

I am trying to reduce an Xarray from 4 x 4 to 2 x 2 via both the dimensions. I haven't found any luck with the current Xarray Dataset. These are the steps I followed. I want to bin or group based on latitude and longitude both simultaneously to reduce number of steps. Currently I can achieve this by just GroupBy method which doesn't seem to perform GroupBy on both the coordinates.

To elaborate the idea I want to achieve :
1. Considering a matrix of 4x4 , first we will group elements of index (0,0) with index (0,1) as A , index of (0,2) with index (0,3) as B, index of (1,0) with index (1,1) as C , index of (1,2) with index (1,3) as D and so on so forth. Last combination being index of (3,2) with index (3,3) as H.
2. This turns the matrix of 4x4 to 4x2 and now we combine elements A with C and B with D and so and so forth. The final matrix size should be 2x2.
3. The combination of elements can be done with any aggregation functions like mean()
or std() and needs to be done over the coordinate of 'Latitude' and 'Longitude in reference to the data variables 'Cloud Fraction'
4. Is there a way to obtain this with Xarray functions and automate it with any input matrix size .

Expected Output

To elaborate the idea I want to achieve :
1. Considering a matrix of 4x4 , first we will group elements of index (0,0) with index (0,1) as A , index of (0,2) with index (0,3) as B, index of (1,0) with index (1,1) as C , index of (1,2) with index (1,3) as D and so on so forth. Last combination being index of (3,2) with index (3,3) as H.
2. This turns the matrix of 4x4 to 4x2 and now we combine elements A with C and B with D and so and so forth. The final matrix size should be 2x2.
3. The combination of elements can be done with any aggregation functions like mean()
or std().

Numpy Version. : v1.15.1 Xarray Version : v0.10.9 Python Verison : v3.7.0 Jupyter Notebook : Locally Hosted
@saviokay saviokay changed the title Multidimensional Binning & Array Reduction Through Both Dimension On Sample Dataset of 4 x4 to 2 x 2 Multidimensional Binning & Array Reduction Through Both Coordinates On Sample Dataset of 4 x4 to 2 x 2 Oct 29, 2018
@fujiisoup fujiisoup changed the title Multidimensional Binning & Array Reduction Through Both Coordinates On Sample Dataset of 4 x4 to 2 x 2 Support binning? Oct 30, 2018
@fujiisoup
Copy link
Member

This is from a thread at SO.

Does anyone have an opinion if we add a bin (or rolling_bin) method to compute the binning?
For the above example, currently we need to do

dsa.rolling(x=2).construct('tmp').isel(x=slice(1, None, 2)).mean('tmp')

which is a little complex.

@rabernat
Copy link
Contributor

This is being discussed in #1192 under a different name.

Yes, we need this feature.

@rabernat
Copy link
Contributor

rabernat commented Oct 30, 2018

FYI, I do this often in my work with this sort of function:

import xarray as xr
from skimage.measure import block_reduce
def aggregate_da(da, agg_dims, suf='_agg'):
    input_core_dims = list(agg_dims)
    n_agg = len(input_core_dims)
    core_block_size = tuple([agg_dims[k] for k in input_core_dims])
    block_size = (da.ndim - n_agg)*(1,) + core_block_size
    output_core_dims = [dim + suf for dim in input_core_dims]
    output_sizes = {(dim + suf): da.shape[da.get_axis_num(dim)]//agg_dims[dim]
                    for dim in input_core_dims}
    output_dtypes = da.dtype
    da_out = xr.apply_ufunc(block_reduce, da, kwargs={'block_size': block_size},
                            input_core_dims=[input_core_dims],
                            output_core_dims=[output_core_dims],
                            output_sizes=output_sizes,
                            output_dtypes=[output_dtypes],
                            dask='parallelized')
    for dim in input_core_dims:
        new_coord = block_reduce(da[dim].data, (agg_dims[dim],), func=np.mean)
        da_out.coords[dim + suf] = (dim + suf, new_coord)
    return da_out

@shoyer
Copy link
Member

shoyer commented Oct 30, 2018

I'm +1 for adding this feature in some form as well.

From an API perspective, should the window size be specified in term of integer or coordinates?

  • rolling is integer based
  • resample is coordinate based

I would lean towards a coordinate based representation since it's a little more usable/certain to be correct. It might even make sense to still call this resample, though obviously the time options would no longer apply. Also, we would almost certainly want a faster underlying implementation than what we currently use for resample().

The API for resampling to a 2x2 degree latitude/longittude grid could look something like: da.resample(lat=2, lon=2).mean()

@shoyer shoyer changed the title Support binning? Multi-dimensional binning/resampling/coarsening Oct 30, 2018
@rabernat
Copy link
Contributor

rabernat commented Oct 30, 2018

I would lean towards a coordinate based representation since it's a little more usable/certain to be correct.

I feel that this could become too complex in the case of irregularly spaced coordinates. I slightly favor the index-based approach (as in my function above), which one calls like

aggregate_da(da, {'lat': 2, 'lon': 2})

If we do that, we can just use scikit-image's block_reduce function, which is vectorized and works great with apply_ufunc.

@jbusecke
Copy link
Contributor

I agree with @rabernat, and favor the index based approach.
For regular lon-lat grids its easy enough to implement a weighted mean, and for irregular spaced grids and other exotic grids the coordinate based approach might lead to errors. To me the resample API above might suggest to some users that some proper regridding (a la xESMF) onto a regular lat/lon grid is performed.

‚block_reduce‘ sounds good to me and sounds appropriate for non-dask arrays. Does anyone have experience how ‚dask.coarsen‘ compares performance wise?

@fujiisoup
Copy link
Member

block_reduce sounds nice, but I am a little hesitating to add a soft-dependence of scikit-image only for this function...
It is using the strid trick, as we are doing in rolling.construct. Maybe we can implement it by ourselves.

@shoyer
Copy link
Member

shoyer commented Oct 31, 2018 via email

@shoyer
Copy link
Member

shoyer commented Nov 1, 2018

OK, so maybe da.block({'lat': 2, 'lon': 2}).mean() would be a good way to spell this, if that's not too confusing with .chunk()? Other possible method names: groupby_block, blocked?

We could call this something like coarsen() or block_reduce() with a how='mean' or maybe func=mean argument, but I like the consistency with resample/rolling/groupby.

We can save the full coordinate based version for a later addition to .resample()

@jbusecke
Copy link
Contributor

jbusecke commented Nov 1, 2018

My favorite would be da.coarsen({'lat': 2, 'lon': 2}).mean(), but all the others sound reasonable to me.
Also +1 for consistency with resample/rolling/groupby.

@shoyer
Copy link
Member

shoyer commented Nov 1, 2018

skimage implements block_reduce via the view_as_blocks utility function: https://github.com/scikit-image/scikit-image/blob/62e29cd89dc858d8fb9d3578034a2f456f298ed3/skimage/util/shape.py#L9-L103

But given that it doesn't actually duplicate any elements and needs a C-order array to work, I think it's actually just equivalent to use use reshape + transpose, e.g., B = A.reshape(4, 1, 2, 2, 3, 2).transpose([0, 2, 4, 1, 3, 5]) reproduces skimage.util.view_as_blocks(A, (1, 2, 2)) from the docstring example.

So the super-simple version of block-reduce looks like:

def block_reduce(image, block_size, func=np.sum):
    # TODO: input validation
    # TODO: consider copying padding from skimage
    blocked_shape = []
    for existing_size, block_size in zip(image.shape, block_size):
        blocked_shape.extend([existing_size // block_size, block_size])
    blocked = np.reshape(image, tuple(blocked_shape))
    return func(blocked, axis=tuple(range(1, blocked.ndim, 2)))

This would work on dask arrays out of the box but it's probably worth benchmarking whether you'd get better performance doing the operation chunk-wise (e.g., with map_blocks).

@fujiisoup
Copy link
Member

+1 for block

What would the coordinates look like?

  1. apply func also for coordinate
  2. always apply mean to coordinate

@dcherian
Copy link
Contributor

dcherian commented Nov 2, 2018

I like coarsen because it's a verb like resample, groupby.

@rabernat
Copy link
Contributor

What would the coordinates look like?

  1. apply func also for coordinate
  2. always apply mean to coordinate

If I think about my applications, I would probably always want to apply mean to dimension coordinates, but would like to be able to choose for non-dimension coordinates.

@jbusecke
Copy link
Contributor

jbusecke commented Nov 19, 2018 via email

@fujiisoup
Copy link
Member

Thinking its API.
I like rolling-like API. One in my mind is

ds.coarsen(x=2, y=2, side='left', trim_excess=True).mean()

To apply a customized callable other than np.mean to a particular coordinate, it would probably be

ds.coarsen(x=2, y=2, side='left', trim_excess=True).mean(coordinate_apply={'surface_area': np.sum})

@fujiisoup fujiisoup mentioned this issue Dec 16, 2018
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants