Skip to content

BUG: union_categoricals w/Series & CategoricalIndex #14173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue Sep 7, 2016 · 8 comments
Closed

BUG: union_categoricals w/Series & CategoricalIndex #14173

jreback opened this issue Sep 7, 2016 · 8 comments
Labels
Bug Categorical Categorical Data Type
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Sep 7, 2016

In [25]: from pandas.types.concat import union_categoricals

In [26]: s1 = pd.Series(['a', 'b'], dtype='category')
    ...: s3 = pd.Series(['a', 'b', 'c'], dtype='category')

In [27]: s1
Out[27]: 
0    a
1    b
dtype: category
Categories (2, object): [a, b]

In [28]: s3
Out[28]: 
0    a
1    b
2    c
dtype: category
Categories (3, object): [a, b, c]

I don't see why this shouldn't work. This should return a Categorical.

In [29]: union_categoricals([s1,s3])
AttributeError: 'Series' object has no attribute 'categories'

actual categoricals are combinable

In [30]: union_categoricals([s1.values,s3.values])
Out[30]: 
[a, b, a, b, c]
Categories (3, object): [a, b, c]

This works fine

In [31]: union_categoricals([s1.values,pd.CategoricalIndex(s3.values)])
Out[31]: 
[a, b, a, b, c]
Categories (3, object): [a, b, c]

But this is broken

In [32]: union_categoricals([pd.CategoricalIndex(s1.values),pd.CategoricalIndex(s3.values)])
AttributeError: 'CategoricalIndex' object has no attribute 'is_dtype_equal'
@jreback jreback added this to the 0.19.0 milestone Sep 7, 2016
@jreback
Copy link
Contributor Author

jreback commented Sep 7, 2016

@jreback
Copy link
Contributor Author

jreback commented Sep 7, 2016

xref to #13767 though this is independent

@jreback
Copy link
Contributor Author

jreback commented Sep 7, 2016

further I think we need a doc-note at the end of the unionining section in categorical.rst. Showing that when unioned categoricals are re-coded. Even though this is technically a dtype implementation detail that the categoricals hide, unioning re-codes, so the user should be aware.

In [16]: c1 = pd.Categorical(['a', 'b'])
    ...: c2 = pd.Categorical(['b', 'c'])

In [17]: c1
Out[17]: 
[a, b]
Categories (2, object): [a, b]

# in c1, b is coded to 1
In [20]: c1.codes
Out[20]: array([0, 1], dtype=int8)

In [21]: c2
Out[21]: 
[b, c]
Categories (2, object): [b, c]

# in c2, b is coded to 0
In [22]: c2.codes
Out[22]: array([0, 1], dtype=int8)

In [23]: from pandas.types.concat import union_categoricals

# b is now re-coded to 1 (same as c2, but not the same as c1)
In [24]: union_categoricals([c1,c2])
Out[24]: 
[a, b, b, c]
Categories (3, object): [a, b, c]

In [25]: union_categoricals([c1,c2]).codes
Out[25]: array([0, 1, 1, 2], dtype=int8)

note that some users, @mrocklin (dask), actually like/want this!

@jorisvandenbossche
Copy link
Member

Yep, I agree that union_categoricals should handle categorical serieses

@chris-b1
Copy link
Contributor

chris-b1 commented Sep 8, 2016

How should the index be handled in the Series case - thrown away, or add an ignore_index kw to support what concat does?

In [1]: pd.concat([pd.Series([1,2,3]), pd.Series([3,4,5])])
Out[1]: 
0    1
1    2
2    3
0    3
1    4
2    5
dtype: int64

In [2]: pd.concat([pd.Series([1,2,3]), pd.Series([3,4,5])], ignore_index=True)
Out[2]: 
0    1
1    2
2    3
3    3
4    4
5    5
dtype: int64

@jorisvandenbossche
Copy link
Member

I would do it similar as concat does (although that in many cases the ignore_index=True case would be a better default ...)

@jreback
Copy link
Contributor Author

jreback commented Sep 8, 2016

I disagree, I think this should always return a Categorical. pd.concat is designed to return Series/DataFrame, but union_categoricals specifically is meant for returning a Categorical.

Until we add this functionaility to pd.concat, concat of Series is a big tricky, but I don't think we should change the api of this function (otherwise it just becomes another concat)

@chris-b1
Copy link
Contributor

chris-b1 commented Sep 8, 2016

Oh, you wouldn't even box it back in the original type? I guess that makes sense as that usecase should really be handled by concat once the api is figured out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants