-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: preserve categorical & sparse types when grouping / pivot & preserve dtypes on ufuncs #26550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #26550 +/- ##
==========================================
- Coverage 91.77% 91.74% -0.03%
==========================================
Files 174 174
Lines 50642 50666 +24
==========================================
+ Hits 46476 46484 +8
- Misses 4166 4182 +16
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26550 +/- ##
==========================================
- Coverage 91.97% 91.95% -0.03%
==========================================
Files 180 180
Lines 50756 50789 +33
==========================================
+ Hits 46685 46705 +20
- Misses 4071 4084 +13
Continue to review full report at Codecov.
|
@@ -104,12 +104,19 @@ def _cython_agg_blocks(self, how, alt=None, numeric_only=True, | |||
|
|||
obj = self.obj[data.items[locs]] | |||
s = groupby(obj, self.grouper) | |||
result = s.aggregate(lambda x: alt(x, axis=self.axis)) | |||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this for? Not immediately obvious the link between this and the overall PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this handles blocks that return NotImplementedError and then cann't be aggregated, e.g. Categoricals with string categories, aggregating with mean for exampel (and numeric_only=False
) is passed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example? I'm also having trouble seeing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If those return NotImplementedError can we limit the scope of the catching to just that?
@jreback rough sketch of the idea in #26550 (comment) diff --git a/pandas/core/generic.py b/pandas/core/generic.py
index ce1cb37ab..898d03e6c 100644
--- a/pandas/core/generic.py
+++ b/pandas/core/generic.py
@@ -1945,6 +1945,10 @@ class NDFrame(PandasObject, SelectionMixin):
# GH#23114 Ensure ndarray.__op__(DataFrame) returns NotImplemented
__array_priority__ = 1000
+ def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
+ result = self._data.map_blocks(ufunc, **kwargs)
+ return type(self)(result)
+
def __array__(self, dtype=None):
return com.values_from_object(self)
diff --git a/pandas/core/internals/managers.py b/pandas/core/internals/managers.py
index 0b63588c9..9c7cc4bd5 100644
--- a/pandas/core/internals/managers.py
+++ b/pandas/core/internals/managers.py
@@ -117,6 +117,10 @@ class BlockManager(PandasObject):
self._rebuild_blknos_and_blklocs()
+ def map_blocks(self, func, **kwargs):
+ newbs = [make_block(func(block.values, **kwargs), block.mgr_locs) for block in self.blocks]
+ return type(self)(newbs, self.axes)
+
def make_empty(self, axes=None):
""" return an empty BlockManager with the items axis of len 0 """
if axes is None: usage: In [1]: import numpy as np; import pandas as pd
In [2]: df = pd.DataFrame({"A": [1, 0]}, dtype=pd.SparseDtype('int'))
In [3]: np.sin(df)
Out[3]:
A
0 0.841471
1 0.000000
In [4]: np.sin(df).dtypes
Out[4]:
A Sparse[float64, 0.0]
dtype: object The On this branch, the casting seems problematic: In [10]: df = pd.DataFrame({"A": [1, 2, 3]}, dtype=pd.SparseDtype('int'))
In [11]: np.sqrt(df)
Out[11]:
A
0 1
1 1
2 1 |
yeah i can do via blocks array_ufunc is orthogonal here - i don’t think we have any implementation yet (should though) |
fixed up the casting, a bit tricky. |
@@ -5755,6 +5736,11 @@ def astype(self, dtype, copy=True, errors='raise', **kwargs): | |||
**kwargs) | |||
return self._constructor(new_data).__finalize__(self) | |||
|
|||
if not results: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What case hits this? I'm not immediately seeing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
empty frames :->
@@ -104,12 +104,19 @@ def _cython_agg_blocks(self, how, alt=None, numeric_only=True, | |||
|
|||
obj = self.obj[data.items[locs]] | |||
s = groupby(obj, self.grouper) | |||
result = s.aggregate(lambda x: alt(x, axis=self.axis)) | |||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example? I'm also having trouble seeing this.
""" | ||
try: | ||
result = self._holder._from_sequence( | ||
np.asarray(result).ravel(), dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the asarray
and ravel needed? _from_sequence
should take any sequence, so the trip through NumPy shouldn't be needed. Can result
be 2-D here? I can't imagine us wanting to ravel a 2-D array, since the length will change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments (short answer is this could be a 2-D numpy array or a EA which is already 1-D)
'std', | ||
'var', | ||
'sem', | ||
pytest.param('median', marks=pytest.mark.xfail( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a know issue we have an issue open for already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no was discovered here, but I can't repro locally
@@ -104,12 +104,19 @@ def _cython_agg_blocks(self, how, alt=None, numeric_only=True, | |||
|
|||
obj = self.obj[data.items[locs]] | |||
s = groupby(obj, self.grouper) | |||
result = s.aggregate(lambda x: alt(x, axis=self.axis)) | |||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If those return NotImplementedError can we limit the scope of the catching to just that?
pandas/core/groupby/groupby.py
Outdated
@@ -1319,6 +1328,16 @@ def f(self, **kwargs): | |||
except Exception: | |||
result = self.aggregate( | |||
lambda x: npfunc(x, axis=self.axis)) | |||
|
|||
# coerce the columns if we can | |||
if isinstance(result, DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this logic be in _try_cast
? OK as a follow up just want to confirm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, _try_cast is for 1d results, maybe I can make this cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sligthly refactored
push should make this pass, will address comments soon. |
preserve dtypes when applying a ufunc to a sparse dtype closes pandas-dev#18502 closes pandas-dev#23743
replacing with a PR just with the categorical grouping changes; the sparse ufuncs is conflicting with what @TomAugspurger is doing and is messy. |
closes #18502
closes #23743