Skip to content

ENH: Support ExtensionArray in Groupby #20502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 28, 2018

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Mar 27, 2018

In [1]: import pandas as pd

In [2]: from cyberpandas import IPArray

In [3]: df = pd.DataFrame({"A": IPArray([0, 0, 1, 2, 2]), "B": [1, 5, 1, 1, 3]})

In [4]: df
Out[4]:
         A  B
0  0.0.0.0  1
1  0.0.0.0  5
2  0.0.0.1  1
3  0.0.0.2  1
4  0.0.0.2  3

In [5]: df.groupby("A").B.mean()
Out[5]:
A
0.0.0.1    1
0.0.0.2    2
Name: B, dtype: int64

Note that right now Out[5].index just just an Index with object dtype. In the future, we could tie an Index type to an ExtensionArray type, and ensure that the extension type propagates through.

```python
In [1]: import pandas as pd

In [2]: from cyberpandas import IPArray

In [3]: df = pd.DataFrame({"A": IPArray([0, 0, 1, 2, 2]), "B": [1, 5, 1, 1, 3]})

In [4]: df
Out[4]:
         A  B
0  0.0.0.0  1
1  0.0.0.0  5
2  0.0.0.1  1
3  0.0.0.2  1
4  0.0.0.2  3

In [5]: df.groupby("A").B.mean()
Out[5]:
A
0.0.0.1    1
0.0.0.2    2
Name: B, dtype: int64
```
@TomAugspurger TomAugspurger added Groupby ExtensionArray Extending pandas with custom dtypes or arrays. labels Mar 27, 2018
@TomAugspurger TomAugspurger added this to the 0.23.0 milestone Mar 27, 2018
@TomAugspurger
Copy link
Contributor Author

What I have so far is relatively straightforward (surprisingly). But I'm probably missing things. Are there edge cases or other operations we should test?

@codecov
Copy link

codecov bot commented Mar 27, 2018

Codecov Report

Merging #20502 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20502      +/-   ##
==========================================
+ Coverage   91.82%   91.84%   +0.02%     
==========================================
  Files         152      152              
  Lines       49249    49249              
==========================================
+ Hits        45225    45235      +10     
+ Misses       4024     4014      -10
Flag Coverage Δ
#multiple 90.23% <100%> (+0.02%) ⬆️
#single 41.89% <100%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/groupby.py 92.55% <100%> (ø) ⬆️
pandas/util/testing.py 84.52% <0%> (-0.21%) ⬇️
pandas/plotting/_converter.py 66.81% <0%> (+1.73%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 766a480...98a3a85. Read the comment docs.

@jorisvandenbossche
Copy link
Member

If you use as_index=False, can you in that way ensure it keeps the correct extension dtype?

@TomAugspurger
Copy link
Contributor Author

Not easily. By the time we're wrapping up the output, we've long since converted the input to an Index.

That said, once we have ExtensionIndexes, it should be a one-line change:

uniques = Index(uniques, name=self.name)

@jreback
Copy link
Contributor

jreback commented Mar 28, 2018

lgtm. and another reason to make EA a base class for Index.

@jreback jreback merged commit 9b4d0f1 into pandas-dev:master Mar 28, 2018
jreback added a commit to jreback/pandas that referenced this pull request Mar 29, 2018
javadnoorb pushed a commit to javadnoorb/pandas that referenced this pull request Mar 29, 2018
jreback added a commit to jreback/pandas that referenced this pull request Mar 30, 2018
jreback added a commit that referenced this pull request Mar 30, 2018
dworvos pushed a commit to dworvos/pandas that referenced this pull request Apr 2, 2018
kornilova203 pushed a commit to kornilova203/pandas that referenced this pull request Apr 23, 2018
kornilova203 pushed a commit to kornilova203/pandas that referenced this pull request Apr 23, 2018
@TomAugspurger TomAugspurger deleted the ea-groupby-3 branch May 2, 2018 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants