Skip to content

disallow boolean coordinates? #4892

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mathause opened this issue Feb 11, 2021 · 3 comments
Open

disallow boolean coordinates? #4892

mathause opened this issue Feb 11, 2021 · 3 comments

Comments

@mathause
Copy link
Collaborator

mathause commented Feb 11, 2021

Today I stumbled over a small pitfall, which I think could be avoided:

I am working with arrays that have axes labeled with categorical values and I ended up using True/False as labels for some binary categories:

test = xarray.DataArray(
     numpy.ones((3,2)), 
     dims=["binary","ternary"],
     coords={"ternary":[3,7,9],"binary":[False,True]}
)

now came the big surprise, when I wanted to reduce over selections of the data:

test.sel(ternary=[9,3,7]) # does exactly what I expect and gives me the correctly permuted 3x2 array
test.sel(binary=[True,False]) # does not do what I expect

Instead of using the coordinate values like with the ternary category, it uses the list as boolean mask and hence I get a 3x1 array at the binary=False coordinate.

I assume that this behavior is reasonable in most cases - And I for sure will stop using bools as binary category labels.
That said in the above case the conceptually identical call results in completely different outcome.

My (radical) proposal would be: forbid binary coordinates in general to avoid such confusion.

Curious about your thoughts! Hth,

Marti

Originally posted by @martinitus in #4861

@max-sixty
Copy link
Collaborator

It's a good issue!

I would have thought .sel should default to using labels. I recognize that boolean indexing is really helpful — and think we should try and support it more (e.g. #1887).

And I recognize .sel delegates to .isel in some cases — e.g. where there are no indexes on a dimension.

But here, I propose we prioritize the labels above the boolean indexing in .sel. People can still use .isel if they want bool indexing in the method call syntax:

In [4]: test = xr.DataArray(
   ...:      np.ones((3,2)),
   ...:      dims=["ternary","binary"],
   ...:      coords={"ternary":[3,7,9],"binary":[False,True]}
   ...: )

In [5]: test.sel(binary=[True,False])
Out[5]:
<xarray.DataArray (ternary: 3, binary: 1)>
array([[1.],
       [1.],
       [1.]])
Coordinates:
  * ternary  (ternary) int64 3 7 9
  * binary   (binary) bool False

In [6]: test.isel(binary=[True,False])
Out[6]:
<xarray.DataArray (ternary: 3, binary: 1)>
array([[1.],
       [1.],
       [1.]])
Coordinates:
  * ternary  (ternary) int64 3 7 9
  * binary   (binary) bool False

I would also be OK removing the .sel delegation to .isel if it causes complications. IIUC the main use case is a single method call to index labels and dimensions without labels. But that's easy to replace with two method calls, I think?

@martinitus
Copy link

I don't know the internals of delegation between .sel and .isel. But from the user side I would expect that boolean indexing requires me to use .isel naturally. I mean, I have to provide a boolean mask that fits the shape of the array, i.e. it is naturally index based and should only be used with .isel irrespective of the coordinate types.

While that probably be a breaking change for some people, I think it makes a quite complicated topic slightly easier to document, and figure out intentions in written code.

@doronbehar
Copy link

Hey all. I opened #9917 which reports an error (and not an undefined behavior) when trying to use .sel with a boolean dtype dimension. Since the topic is still pretty much the same, and as requested by @dcherian, I'll note here my MWE , some other information relevant, and a workaround I'm using with isel.

Here's the MWE:

#!/usr/bin/env python

import xarray as xr
import numpy as np

# Define coordinates
float_coords = np.linspace(0.0, 1.0, 5)  # Float coordinates from 0.0 to 1.0
bool_coords = np.array([True, False])    # Boolean coordinates
# Create the Dataset
example_dataset = xr.Dataset(
    coords={
        "float_dim": float_coords,
        "bool_dim": bool_coords,
    }
)
loc = {
    "float_dim": 0,
    "bool_dim": True,
}
print(example_dataset.sel({
    k: [v]
    for k,v in loc.items()
}))

It fails with:

Traceback (most recent call last):
  File "/home/doron/repos/lab-ion-trap-simulations/./t.py", line 22, in <module>
    print(example_dataset.sel({"bool_dim": [ True ]}))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/r2g9s5hcmndxk6lwxwavpg3ga33sf18c-python3-3.12.7-env/lib/python3.12/site-packages/xarray/core/dataset.py", line 3237, in sel
    result = self.isel(indexers=query_results.dim_indexers, drop=drop)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/r2g9s5hcmndxk6lwxwavpg3ga33sf18c-python3-3.12.7-env/lib/python3.12/site-packages/xarray/core/dataset.py", line 3080, in isel
    indexes, index_variables = isel_indexes(self.xindexes, indexers)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/r2g9s5hcmndxk6lwxwavpg3ga33sf18c-python3-3.12.7-env/lib/python3.12/site-packages/xarray/core/indexes.py", line 1872, in isel_indexes
    return _apply_indexes_fast(indexes, indexers, "isel")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/r2g9s5hcmndxk6lwxwavpg3ga33sf18c-python3-3.12.7-env/lib/python3.12/site-packages/xarray/core/indexes.py", line 1827, in _apply_indexes_fast
    new_index = getattr(index, func)(index_args)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/r2g9s5hcmndxk6lwxwavpg3ga33sf18c-python3-3.12.7-env/lib/python3.12/site-packages/xarray/core/indexes.py", line 743, in isel
    return self._replace(self.index[indxr])  # type: ignore[index]
                         ~~~~~~~~~~^^^^^^^
  File "/nix/store/r2g9s5hcmndxk6lwxwavpg3ga33sf18c-python3-3.12.7-env/lib/python3.12/site-packages/pandas/core/indexes/base.py", line 5416, in __getitem__
    result = getitem(key)
             ^^^^^^^^^^^^
IndexError: boolean index did not match indexed array along dimension 0; dimension is 2 but corresponding boolean dimension is 1

However, using:

print(example_dataset.sel({"bool_dim":  True }))

Doesn't fail, but it squeezes bool_dim to a 0 size array, and I don't want that to happen, since want to end up with a bool_dim of size 1, and I don't want to use .expand_dims because I have a non-dimension coordinate attached to that bool_dim and I lose that attachment information due to #4501 .

Lastly, the workaround I'm using with isel is:

print(example_dataset.isel({
    k: (example_dataset[k].values == v).nonzero()[0]
    for k,v in loc.items()
}))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants