-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Multi-dimensional binning/resampling/coarsening #2525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is from a thread at SO. Does anyone have an opinion if we add a dsa.rolling(x=2).construct('tmp').isel(x=slice(1, None, 2)).mean('tmp') which is a little complex. |
This is being discussed in #1192 under a different name. Yes, we need this feature. |
FYI, I do this often in my work with this sort of function: import xarray as xr
from skimage.measure import block_reduce
def aggregate_da(da, agg_dims, suf='_agg'):
input_core_dims = list(agg_dims)
n_agg = len(input_core_dims)
core_block_size = tuple([agg_dims[k] for k in input_core_dims])
block_size = (da.ndim - n_agg)*(1,) + core_block_size
output_core_dims = [dim + suf for dim in input_core_dims]
output_sizes = {(dim + suf): da.shape[da.get_axis_num(dim)]//agg_dims[dim]
for dim in input_core_dims}
output_dtypes = da.dtype
da_out = xr.apply_ufunc(block_reduce, da, kwargs={'block_size': block_size},
input_core_dims=[input_core_dims],
output_core_dims=[output_core_dims],
output_sizes=output_sizes,
output_dtypes=[output_dtypes],
dask='parallelized')
for dim in input_core_dims:
new_coord = block_reduce(da[dim].data, (agg_dims[dim],), func=np.mean)
da_out.coords[dim + suf] = (dim + suf, new_coord)
return da_out |
I'm +1 for adding this feature in some form as well. From an API perspective, should the window size be specified in term of integer or coordinates?
I would lean towards a coordinate based representation since it's a little more usable/certain to be correct. It might even make sense to still call this The API for resampling to a 2x2 degree latitude/longittude grid could look something like: |
I feel that this could become too complex in the case of irregularly spaced coordinates. I slightly favor the index-based approach (as in my function above), which one calls like aggregate_da(da, {'lat': 2, 'lon': 2}) If we do that, we can just use scikit-image's |
I agree with @rabernat, and favor the index based approach. ‚block_reduce‘ sounds good to me and sounds appropriate for non-dask arrays. Does anyone have experience how ‚dask.coarsen‘ compares performance wise? |
|
block_reduce from skimage is indeed a small function using strides/reshape,
if I remember correctly. We should certainly copy or implement it ourselves
rather than adding an skimage dependency.
…On Wed, Oct 31, 2018 at 12:36 AM Keisuke Fujii ***@***.***> wrote:
block_reduce sounds nice, but I am a little hesitating to add a
soft-dependence of scikit-image only for this function...
It is using the strid trick, as we are doing in rolling.construct. Maybe
we can implement it by ourselves.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2525 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1pyQW3C2CLJjQBMn49CnIOo5b3P0ks5uqVMJgaJpZM4X_q9a>
.
|
OK, so maybe We could call this something like We can save the full coordinate based version for a later addition to |
My favorite would be |
skimage implements But given that it doesn't actually duplicate any elements and needs a C-order array to work, I think it's actually just equivalent to use use So the super-simple version of block-reduce looks like: def block_reduce(image, block_size, func=np.sum):
# TODO: input validation
# TODO: consider copying padding from skimage
blocked_shape = []
for existing_size, block_size in zip(image.shape, block_size):
blocked_shape.extend([existing_size // block_size, block_size])
blocked = np.reshape(image, tuple(blocked_shape))
return func(blocked, axis=tuple(range(1, blocked.ndim, 2))) This would work on dask arrays out of the box but it's probably worth benchmarking whether you'd get better performance doing the operation chunk-wise (e.g., with |
+1 for What would the coordinates look like?
|
I like |
If I think about my applications, I would probably always want to apply |
I think mean would be a good default (thinking about cell center dimensions like longitude and latitude) but I would very much like it if other functions could be specified e. g. for grid face dimensions (where min and max would be more appropriate) and other coordinates like surface area (where sum would be the most appropriate function).
… On Nov 18, 2018, at 11:13 PM, Ryan Abernathey ***@***.***> wrote:
What would the coordinates look like?
apply func also for coordinate
always apply mean to coordinate
If I think about my applications, I would probably always want to apply mean to dimension coordinates, but would like to be able to choose for non-dimension coordinates.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Thinking its API. ds.coarsen(x=2, y=2, side='left', trim_excess=True).mean() To apply a customized callable other than ds.coarsen(x=2, y=2, side='left', trim_excess=True).mean(coordinate_apply={'surface_area': np.sum}) |
Code Sample, a copy-pastable example if possible
The StackOverflow Query : https://stackoverflow.com/questions/52886703/xarray-multidimensional-binning-array-reduction-on-sample-dataset-of-4-x4-to/52981916#52981916
Problem description
I am trying to reduce an Xarray from 4 x 4 to 2 x 2 via both the dimensions. I haven't found any luck with the current Xarray Dataset. These are the steps I followed. I want to bin or group based on latitude and longitude both simultaneously to reduce number of steps. Currently I can achieve this by just GroupBy method which doesn't seem to perform GroupBy on both the coordinates.
To elaborate the idea I want to achieve :
1. Considering a matrix of 4x4 , first we will group elements of index (0,0) with index (0,1) as A , index of (0,2) with index (0,3) as B, index of (1,0) with index (1,1) as C , index of (1,2) with index (1,3) as D and so on so forth. Last combination being index of (3,2) with index (3,3) as H.
2. This turns the matrix of 4x4 to 4x2 and now we combine elements A with C and B with D and so and so forth. The final matrix size should be 2x2.
3. The combination of elements can be done with any aggregation functions like mean()
or std() and needs to be done over the coordinate of 'Latitude' and 'Longitude in reference to the data variables 'Cloud Fraction'
4. Is there a way to obtain this with Xarray functions and automate it with any input matrix size .
Expected Output
To elaborate the idea I want to achieve :
1. Considering a matrix of 4x4 , first we will group elements of index (0,0) with index (0,1) as A , index of (0,2) with index (0,3) as B, index of (1,0) with index (1,1) as C , index of (1,2) with index (1,3) as D and so on so forth. Last combination being index of (3,2) with index (3,3) as H.
2. This turns the matrix of 4x4 to 4x2 and now we combine elements A with C and B with D and so and so forth. The final matrix size should be 2x2.
3. The combination of elements can be done with any aggregation functions like mean()
or std().
The text was updated successfully, but these errors were encountered: