Skip to content

apply_ufunc(dask='parallelized') output_dtypes for datasets #1699

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
crusaderky opened this issue Nov 7, 2017 · 8 comments
Open

apply_ufunc(dask='parallelized') output_dtypes for datasets #1699

crusaderky opened this issue Nov 7, 2017 · 8 comments

Comments

@crusaderky
Copy link
Contributor

crusaderky commented Nov 7, 2017

When a Dataset has variables with different dtypes, there's no way to tell apply_ufunc that the same function applied to different variables will produce different dtypes:

ds1 = xarray.Dataset(data_vars={'a': ('x', [1, 2]), 'b': ('x', [3.0, 4.5])}).chunk()
ds2 = xarray.apply_ufunc(lambda x: x + 1, ds1, dask='parallelized', output_dtypes=[float])
ds2

<xarray.Dataset>
Dimensions:  (x: 2)
Dimensions without coordinates: x
Data variables:
    a        (x) float64 dask.array<shape=(2,), chunksize=(2,)>
    b        (x) float64 dask.array<shape=(2,), chunksize=(2,)>

ds2.compute()

<xarray.Dataset>
Dimensions:  (x: 2)
Dimensions without coordinates: x
Data variables:
    a        (x) int64 2 3
    b        (x) float64 4.0 5.5

Proposed solution

When the output is a dataset, apply_ufunc could accept either output_dtypes=[t] (if all output variables will have the same dtype) or output_dtypes=[{var1: t1, var2: t2, ...}]. In the example above, it would be output_dtypes=[{'a': int, 'b': float}].

@shoyer
Copy link
Member

shoyer commented Nov 8, 2017

Yes, I like this. Though it's worth considering whether the syntax should reverse the list/dict nesting, e.g., output_dtypes={var1: [t1, ...], var2: [t2, ...], ...}.

@crusaderky
Copy link
Contributor Author

@shoyer that seems counter-intuitive for me - you are returning two datasets after all.
If we go with the list(dict) notation, we could also add a Dataset.dtype property, which (coherently with dims and chunks) would return a dict. This would be very handy as, in 99% of the times, people will want to write:

def myfunc(x):
    return apply_ufunc(numpy_kernel, x, dask='parallelized', output_dtypes=[x.dtype])

which would magically work both when x is a DataArray and when it's a Dataset

@shoyer
Copy link
Member

shoyer commented Apr 25, 2018

I'm not sure about adding Dataset.dtype. Certainly Dataset.dtypes returning a dict would make sense -- that would match how pandas defines DataFrame.dtypes.

Anyways, I agree that output_dtypes=[{var1: t1, var2: t2, ...}, ...] is most natural, because it matches the structure of the outputs.

@crusaderky
Copy link
Contributor Author

crusaderky commented May 4, 2018

The key thing is that for most people it would be extremely elegant and practical to be able to duck-type wrappers around numpy, scipy, and numba kernels that automagically work with Variable, DataArray, and Dataset (see my example above).
You'll agree on how ugly my 1-liner above would become:

def myfunc(x):
    if isinstance(x, xarray.Dataset):
        dtype = x.dtypes
    else:  # DataArray and Variable
        dtype = x.dtype
    return apply_ufunc(numpy_kernel, x, dask='parallelized', output_dtypes=[dtype])

If you don't like Dataset.dtype, then maybe we could add both Dataset.dtypes and DataArray.dtypes (which would be just an alias to DataArray.dtype)? I still like the former more though - I find it less confusing.

@shoyer
Copy link
Member

shoyer commented May 5, 2018

dtype = [getattr(x, 'dtype', getattr(x, 'dtypes'))] would be another alternative, but I agree it's ugly. The ternary expression dtype = x.dtypes if isinstance(x, xarray.Dataset) else x.dtype would also work.

I agree with the concern about duck typing, but my concern with Dataset.dtype is that there is strong convention for a dtype attribute to be an actual NumPy dtype.

Another option would be accept either objects with a dtype or dtypes in output_dtypes, like np.result_type(). Then you could write output_dtype=[x].

@stale
Copy link

stale bot commented Apr 4, 2020

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Apr 4, 2020
@crusaderky
Copy link
Contributor Author

still relevant

@stale stale bot removed the stale label Apr 6, 2020
@jhamman
Copy link
Member

jhamman commented Apr 6, 2020

also pinging @dcherian who has been working on a similar problem set with map_blocks in #3816.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants