Skip to content

IndexError when using multi-variable BinGrouper #9630

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
5 tasks done
phil-blain opened this issue Oct 15, 2024 · 2 comments · Fixed by #9650
Closed
5 tasks done

IndexError when using multi-variable BinGrouper #9630

phil-blain opened this issue Oct 15, 2024 · 2 comments · Fixed by #9650

Comments

@phil-blain
Copy link
Contributor

phil-blain commented Oct 15, 2024

What happened?

I tried using the new multi-dimensional grouping added in #9372, with one BinGrouper per dimension. I'm using version 2024.09.0. If I construct the BinGrouper such that some bins end up empty, I get an IndexError:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[9], line 1
----> 1 ds.groupby(x=BinGrouper(np.arange(0,13,4)), y=BinGrouper(bins=np.arange(0,16,2)))

File /home/me/.conda/envs/xarray_2024.09/lib/python3.12/site-packages/xarray/util/deprecation_helpers.py:118, in _deprecate_positional_args.<locals>._decorator.<locals>.inner(*args, **kwargs)
    114     kwargs.update({name: arg for name, arg in zip_args})
    116     return func(*args[:-n_extra_args], **kwargs)
--> 118 return func(*args, **kwargs)

File /home/me/.conda/envs/xarray_2024.09/lib/python3.12/site-packages/xarray/core/dataset.py:10444, in Dataset.groupby(self, group, squeeze, restore_coord_dims, **groupers)
  10441 _validate_groupby_squeeze(squeeze)
  10442 rgroupers = _parse_group_and_groupers(self, group, groupers)
> 10444 return DatasetGroupBy(self, rgroupers, restore_coord_dims=restore_coord_dims)

File /home/me/.conda/envs/xarray_2024.09/lib/python3.12/site-packages/xarray/core/groupby.py:581, in GroupBy.__init__(self, obj, groupers, restore_coord_dims)
    573     if any(
    574         isinstance(obj._indexes.get(grouper.name, None), PandasMultiIndex)
    575         for grouper in groupers
    576     ):
    577         raise NotImplementedError(
    578             "Grouping by multiple variables, one of which "
    579             "wraps a Pandas MultiIndex, is not supported yet."
    580         )
--> 581     self.encoded = ComposedGrouper(groupers).factorize()
    583 # specification for the groupby operation
    584 # TODO: handle obj having variables that are not present on any of the groupers
    585 #       simple broadcasting fails for ExtensionArrays.
    586 (self.group1d, self._obj, self._stacked_dim, self._inserted_dims) = _ensure_1d(
    587     group=self.encoded.codes, obj=obj
    588 )

File /home/me/.conda/envs/xarray_2024.09/lib/python3.12/site-packages/xarray/core/groupby.py:470, in ComposedGrouper.factorize(self)
    464 midx = pd.MultiIndex.from_product(
    465     (grouper.unique_coord.data for grouper in groupers),
    466     names=tuple(grouper.name for grouper in groupers),
    467 )
    468 # Constructing an index from the product is wrong when there are missing groups
    469 # (e.g. binning, resampling). Account for that now.
--> 470 midx = midx[np.sort(pd.unique(_flatcodes[~mask]))]
    472 full_index = pd.MultiIndex.from_product(
    473     (grouper.full_index.values for grouper in groupers),
    474     names=tuple(grouper.name for grouper in groupers),
    475 )
    476 dim_name = "stacked_" + "_".join(str(grouper.name) for grouper in groupers)

File /home/me/.conda/envs/xarray_2024.09/lib/python3.12/site-packages/pandas/core/indexes/multi.py:2207, in MultiIndex.__getitem__(self, key)
   2204 elif isinstance(key, Index):
   2205     key = np.asarray(key)
-> 2207 new_codes = [level_codes[key] for level_codes in self.codes]
   2209 return MultiIndex(
   2210     levels=self.levels,
   2211     codes=new_codes,
   (...)
   2214     verify_integrity=False,
   2215 )

IndexError: index 18 is out of bounds for axis 0 with size 18

What did you expect to happen?

It should work, even if some bins are empty, just like it works correctly for a single dimension.

Minimal Complete Verifiable Example

In [1]: ds = xr.Dataset(
   ...:         {"foo": (("z"), np.random.random_sample(12))},
   ...:         coords={"x": ("z", np.arange(12)), "y": ("z", np.arange(12))},
   ...:     )
In [2]: from xarray.groupers import BinGrouper
In [3]: ds.groupby(x=BinGrouper(np.arange(0,13,4)), y=BinGrouper(bins=np.arange(0,16,2)))

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

If we make sure that no bins are empty, it works, e.g.

ds.groupby(x=BinGrouper(np.arange(0,13,4)), y=BinGrouper(bins=np.arange(0,16,4)))

Also, if we give the same bins as above, but only for a single dimension, it also works:

ds.groupby(y=BinGrouper(bins=np.arange(0,16,2)))

Environment

INSTALLED VERSIONS

commit: None
python: 3.12.7 | packaged by conda-forge | (main, Oct 4 2024, 16:05:46) [GCC 13.3.0]
python-bits: 64
OS: Linux
OS-release: 4.18.0-372.9.1.el8.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.4
libnetcdf: 4.9.2

xarray: 2024.9.0
pandas: 2.2.3
numpy: 2.1.2
scipy: 1.14.1
netCDF4: 1.7.1
pydap: None
h5netcdf: None
h5py: None
zarr: None
cftime: 1.6.4
nc_time_axis: None
iris: None
bottleneck: None
dask: 2024.9.1
distributed: 2024.9.1
matplotlib: 3.9.2
cartopy: None
seaborn: None
numbagg: None
fsspec: 2024.9.0
cupy: None
pint: None
sparse: None
flox: 0.9.12
numpy_groupies: 0.11.2
setuptools: 75.1.0
pip: 24.2
conda: None
pytest: None
mypy: None
IPython: 8.28.0
sphinx: None

@phil-blain phil-blain added bug needs triage Issue that has not been reviewed by xarray team member labels Oct 15, 2024
@phil-blain phil-blain changed the title IndexError when suing multi-variable BinGrouper IndexError when using multi-variable BinGrouper Oct 15, 2024
@headtr1ck headtr1ck added topic-groupby and removed needs triage Issue that has not been reviewed by xarray team member labels Oct 15, 2024
@phil-blain
Copy link
Contributor Author

@dcherian

@phil-blain
Copy link
Contributor Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants