Skip to content

Multi-index with categorical values #3674

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mancellin opened this issue Jan 8, 2020 · 7 comments · Fixed by #3860
Closed

Multi-index with categorical values #3674

mancellin opened this issue Jan 8, 2020 · 7 comments · Fixed by #3860
Labels

Comments

@mancellin
Copy link
Contributor

Building a dataset from pandas with a multi-index with categorical values:

import pandas as pd

cat = pd.CategoricalDtype(categories=['foo', 'bar', 'baz'])
i1 = pd.Series(['foo', 'bar'], dtype=cat)
i2 = pd.Series(['bar', 'bar'], dtype=cat)

df = pd.DataFrame({'i1': i1, 'i2': i2, 'values': [1, 2]})
ds = df.set_index(['i1', 'i2']).to_xarray()

print(ds)

Expected output:

<xarray.Dataset>
Dimensions:  (i1: 2, i2: 1)
Coordinates:
  * i1       (i1) object 'foo' 'bar'
  * i2       (i2) object 'bar'
Data variables:
    values   (i1, i2) int64 1 2

Actual output:

<xarray.Dataset>
Dimensions:  (i1: 3, i2: 3)
Coordinates:
  * i1       (i1) object 'foo' 'bar' 'baz'
  * i2       (i2) object 'foo' 'bar' 'baz'
Data variables:
    values   (i1, i2) float64 nan 1.0 nan nan 2.0 nan nan nan nan

It is not wrong, but it is inconsistent with the non-categorical case (which gives the expected output above) and the single-index case (no filling with NaNs for single index).

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.8.0 (default, Nov 6 2019, 21:49:08) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.19.91-1-MANJARO machine: x86_64 processor: byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8 libhdf5: None libnetcdf: None

xarray: 0.14.1
pandas: 0.25.3
numpy: 1.17.4
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
setuptools: 44.0.0.post20200106
pip: 19.3.1
conda: None
pytest: None
IPython: None
sphinx: None

@fujiisoup
Copy link
Member

Thanks for reporting again.
OK. It looks there are several places to be fixed.

Please add comments here if you find another not-working case.

@fujiisoup
Copy link
Member

xref: #3670

@mancellin
Copy link
Contributor Author

Thank you for your work on this! I haven't found any other issue at the moment, I guess we can close this issue.

@mancellin
Copy link
Contributor Author

Actually, after updating to version 0.15 I've found another issue in the same context.
More precisely, when using a multi-index with a non-categorical repeated coordinate and a categorical coordinate:

import pandas as pd

i1 = pd.Series([0, 0])
cat = pd.CategoricalDtype(categories=['foo', 'bar', 'baz'])
i2 = pd.Series(['foo', 'bar'], dtype=cat)

df = pd.DataFrame({'i1': i1, 'i2': i2, 'values': [1, 2]})
ds = df.set_index(['i1', 'i2']).to_xarray()

print(ds)

raises the following error

Traceback (most recent call last):
  File "/home/matthieu/test.py", line 8, in <module>
    ds = df.set_index(['i1', 'i2']).to_xarray()
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 2867, in to_xarray
    return xarray.Dataset.from_dataframe(self)
  File "/opt/anaconda3/lib/python3.7/site-packages/xarray/core/dataset.py", line 4555, in from_dataframe
    idx = remove_unused_levels_categories(dataframe.index)
  File "/opt/anaconda3/lib/python3.7/site-packages/xarray/core/indexes.py", line 26, in remove_unused_levels_
categories
    index = pd.MultiIndex.from_arrays(levels, names=index.names)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 425, in from_arrays
    raise ValueError("all arrays must be same length")
ValueError: all arrays must be same length

but works fine when i2 is not categorical.

@mancellin mancellin reopened this Feb 6, 2020
@dcherian dcherian added the bug label Feb 6, 2020
@mancellin
Copy link
Contributor Author

@fujiisoup Since you implemented remove_unused_levels_categories, do you have any clue how to fix this?

@fujiisoup
Copy link
Member

@mancellin
Sorry for my no response.
Yes, there may be some possible workarounds, but nowadays I have less spare time...
Do you have the interest to send a PR?

@mancellin
Copy link
Contributor Author

Yes, I'll give it a try.

fujiisoup pushed a commit that referenced this issue Mar 13, 2020
* Fix bug for multi-index with categorical values.

See issue #3674.

* Blacked.

* Add line in whats-new.rst.

* Remove forgotten print.

Co-authored-by: Matthieu Ancellin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants