Skip to content

Wrong hue assignment in scatter plot #4641

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
astoeriko opened this issue Dec 2, 2020 · 7 comments · Fixed by #4723
Closed

Wrong hue assignment in scatter plot #4641

astoeriko opened this issue Dec 2, 2020 · 7 comments · Fixed by #4723

Comments

@astoeriko
Copy link

What happened:
When using the hue keyword in a scatter plot to color the points based on a string variable, the color assignment in the plot is wrong (whereas the legend is correct).

What you expected to happen:
In the example, data of category "A" ranges between 0 and 2 in u-direction and 0 and 0.5 in v-direction. Points in that square should be orange (the color for "A") but currently are blue.

Minimal Complete Verifiable Example:

import xarray as xr
import numpy as np

u = np.random.rand(50, 2) * np.array([1, 2])
v = np.random.rand(50, 2) * np.array([1, 0.5])

ds = xr.Dataset(
    {
        "u": (("x", "category"), u),
        "v": (("x", "category"), v),
    },
    coords={"category": ["B", "A"],}
)

g = ds.plot.scatter(
    y="u",
    x="v",
    hue="category",
);

Anything else we need to know?:
I think that this might be related to sorting at some point. If the variable by which I color is sorted alphabetically (["A", "B"] instead of ["B", "A"]), the color assignment is correct.

Not sure if this issue is related to #4126, bit it looks different to me (the problem is not the legend, but the colors in the plot itself).

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:25:08)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 4.15.0-122-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.16.0
pandas: 1.1.2
numpy: 1.17.5
scipy: 1.5.2
netCDF4: 1.5.4
pydap: None
h5netcdf: None
h5py: 2.10.0
Nio: None
zarr: 2.4.0
cftime: 1.2.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.26.0
distributed: 2.26.0
matplotlib: 3.3.2
cartopy: None
seaborn: 0.11.0
numbagg: None
pint: None
setuptools: 49.6.0.post20200814
pip: 20.2.3
conda: 4.8.3
pytest: 6.0.1
IPython: 7.18.1
sphinx: None

@astoeriko
Copy link
Author

After updating xarray to 0.16.2, the colors in the plot agree with the colors in the legend, so the error indicated above does not persist. We can probably close this issue.
However, this seems to be achieved not by changing the colors in the plot but by sorting the legend as well. That is, the order of the category variable in the legend is ["A", "B"], although I specified it to be ["B", "A"] in the dataset. I am not sure if this is an intended behaviour?

@astoeriko
Copy link
Author

As my original plot still was wrong after updating I investigated a bit further: The problem persists when also faceting.
Here is my new example where, again, data of category "A" get colored as "B" and vice versa.

import xarray as xr
import numpy as np

u = np.random.rand(50, 2, 2) * np.array([1, 2])
v = np.random.rand(50, 2) * np.array([1, 0.5])

ds = xr.Dataset(
    {
        "u": (("x", "foo", "category"), u),
        "v": (("x", "category"), v),
    },
    coords={"category": ["B", "A"], "foo": [1, 2]}
)

g = ds.plot.scatter(
    y="u",
    x="v",
    hue="category",
    col="foo"
);

I am sorry for the confusion.

Output of `xr.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.8 | packaged by conda-forge | (default, Nov 27 2020, 19:24:58)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 4.15.0-122-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.16.2
pandas: 1.1.2
numpy: 1.17.5
scipy: 1.5.3
netCDF4: 1.5.4
pydap: None
h5netcdf: None
h5py: 2.10.0
Nio: None
zarr: 2.4.0
cftime: 1.3.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.26.0
distributed: 2.30.1
matplotlib: 3.3.2
cartopy: None
seaborn: 0.11.0
numbagg: None
pint: None
setuptools: 49.6.0.post20201009
pip: 20.3
conda: 4.8.3
pytest: 6.0.1
IPython: 7.19.0
sphinx: None

@shoyer
Copy link
Member

shoyer commented Dec 3, 2020

could you share an image showing what the incorrect plot(s) looks like? you should be able to "paste" into the comment field in GitHub

@astoeriko
Copy link
Author

astoeriko commented Dec 3, 2020

Here are the plots demonstrating what I mean.

The “upright” rectangle (in the intervals [0, 0.5] and [0, 2]) of points represents the data corresponding to category "A". However, it is colored in blue, which corresponds to category "B". The order of labels in the legend is correct in the sense that it conserves the order in the Dataset.
wrong_color_assignment

In the second image, the color assignment in the plot is correct – data corresponding to category "A" is still colored in blue but that now corresponds to category "A". The legend is now alphabetically ordered instead of conserving the order the category coordinate in the Dataset.
correct_color_assignment

@shoyer
Copy link
Member

shoyer commented Dec 16, 2020

Ugh, this is unfortunate! Thanks for the clear example code. Coincidentally, one of collaborators ran into this same bug this morning. This sort of "corrupted data" bug is one of the nastiest types, so we should definitely try to prioritize a fix.

@keewis
Copy link
Collaborator

keewis commented Dec 21, 2020

this is caused by the use of np.unique here:

for label in np.unique(data["hue"].values):

to fix that, I think we either need to find a way to preserve the order of data["hue"] (the output of np.unique is sorted), or we have to use sorted/np.unique here:
labels=list(self._hue_var.values),

@ahuang11
Copy link
Contributor

Maybe a simple fix would be to replace np.unique with pd.unique since it's ordered?

Hash table-based unique. Uniques are returned in order of appearance. This does NOT sort.

Significantly faster than numpy.unique. Includes NA values.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants