Skip to content

BUG: preserve categorical & sparse types when grouping / pivot & preserve dtypes on ufuncs #26550

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented May 29, 2019

closes #18502
closes #23743

@jreback jreback added Bug Groupby Sparse Sparse Data Type Categorical Categorical Data Type ExtensionArray Extending pandas with custom dtypes or arrays. labels May 29, 2019
@jreback jreback added this to the 0.25.0 milestone May 29, 2019
@jreback jreback changed the title BUG: preserve categorical & sparse types when grouping / pivot BUG: preserve categorical & sparse types when grouping / pivot & preserve dtypes on ufuncs May 29, 2019
@codecov
Copy link

codecov bot commented May 29, 2019

Codecov Report

Merging #26550 into master will decrease coverage by 0.02%.
The diff coverage is 88.88%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26550      +/-   ##
==========================================
- Coverage   91.77%   91.74%   -0.03%     
==========================================
  Files         174      174              
  Lines       50642    50666      +24     
==========================================
+ Hits        46476    46484       +8     
- Misses       4166     4182      +16
Flag Coverage Δ
#multiple 90.28% <88.88%> (-0.02%) ⬇️
#single 41.66% <44.44%> (-0.12%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/generic.py 88.07% <100%> (-0.89%) ⬇️
pandas/core/series.py 93.63% <100%> (+0.01%) ⬆️
pandas/core/frame.py 97.02% <100%> (-0.11%) ⬇️
pandas/core/groupby/ops.py 96% <100%> (ø) ⬆️
pandas/core/nanops.py 94.11% <100%> (ø) ⬆️
pandas/core/groupby/groupby.py 97.39% <100%> (+0.16%) ⬆️
pandas/core/dtypes/cast.py 91.54% <100%> (ø) ⬆️
pandas/core/generic.py 93.42% <25%> (-0.15%) ⬇️
pandas/core/internals/blocks.py 93.94% <71.42%> (-0.19%) ⬇️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7629a18...f397908. Read the comment docs.

@codecov
Copy link

codecov bot commented May 29, 2019

Codecov Report

Merging #26550 into master will decrease coverage by 0.02%.
The diff coverage is 92.3%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26550      +/-   ##
==========================================
- Coverage   91.97%   91.95%   -0.03%     
==========================================
  Files         180      180              
  Lines       50756    50789      +33     
==========================================
+ Hits        46685    46705      +20     
- Misses       4071     4084      +13
Flag Coverage Δ
#multiple 90.55% <92.3%> (-0.02%) ⬇️
#single 41.81% <41.53%> (-0.1%) ⬇️
Impacted Files Coverage Δ
pandas/core/sparse/frame.py 95.67% <100%> (+0.03%) ⬆️
pandas/core/groupby/generic.py 88.48% <100%> (-0.86%) ⬇️
pandas/core/series.py 93.69% <100%> (+0.02%) ⬆️
pandas/core/dtypes/cast.py 90.53% <100%> (ø) ⬆️
pandas/core/nanops.py 94.76% <100%> (ø) ⬆️
pandas/core/frame.py 96.9% <100%> (-0.11%) ⬇️
pandas/core/groupby/ops.py 96% <100%> (ø) ⬆️
pandas/core/groupby/groupby.py 97.46% <100%> (+0.29%) ⬆️
pandas/core/arrays/sparse.py 93.88% <100%> (+0.15%) ⬆️
pandas/core/internals/construction.py 96.19% <100%> (+0.25%) ⬆️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b9b081d...4bd486e. Read the comment docs.

@@ -104,12 +104,19 @@ def _cython_agg_blocks(self, how, alt=None, numeric_only=True,

obj = self.obj[data.items[locs]]
s = groupby(obj, self.grouper)
result = s.aggregate(lambda x: alt(x, axis=self.axis))
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for? Not immediately obvious the link between this and the overall PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this handles blocks that return NotImplementedError and then cann't be aggregated, e.g. Categoricals with string categories, aggregating with mean for exampel (and numeric_only=False) is passed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example? I'm also having trouble seeing this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If those return NotImplementedError can we limit the scope of the catching to just that?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 30, 2019

@jreback rough sketch of the idea in #26550 (comment)

diff --git a/pandas/core/generic.py b/pandas/core/generic.py
index ce1cb37ab..898d03e6c 100644
--- a/pandas/core/generic.py
+++ b/pandas/core/generic.py
@@ -1945,6 +1945,10 @@ class NDFrame(PandasObject, SelectionMixin):
     # GH#23114 Ensure ndarray.__op__(DataFrame) returns NotImplemented
     __array_priority__ = 1000
 
+    def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
+        result = self._data.map_blocks(ufunc, **kwargs)
+        return type(self)(result)
+
     def __array__(self, dtype=None):
         return com.values_from_object(self)
 
diff --git a/pandas/core/internals/managers.py b/pandas/core/internals/managers.py
index 0b63588c9..9c7cc4bd5 100644
--- a/pandas/core/internals/managers.py
+++ b/pandas/core/internals/managers.py
@@ -117,6 +117,10 @@ class BlockManager(PandasObject):
 
         self._rebuild_blknos_and_blklocs()
 
+    def map_blocks(self, func, **kwargs):
+        newbs = [make_block(func(block.values, **kwargs), block.mgr_locs) for block in self.blocks]
+        return type(self)(newbs, self.axes)
+
     def make_empty(self, axes=None):
         """ return an empty BlockManager with the items axis of len 0 """
         if axes is None:

usage:

In [1]: import numpy as np; import pandas as pd

In [2]: df = pd.DataFrame({"A": [1, 0]}, dtype=pd.SparseDtype('int'))

In [3]: np.sin(df)
Out[3]:
          A
0  0.841471
1  0.000000

In [4]: np.sin(df).dtypes
Out[4]:
A    Sparse[float64, 0.0]
dtype: object

The __array_ufunc__ isn't really correct yet, but does make sense? I think that's much cleaner than doing the op and trying to cast back.

On this branch, the casting seems problematic:

In [10]: df = pd.DataFrame({"A": [1, 2, 3]}, dtype=pd.SparseDtype('int'))

In [11]: np.sqrt(df)
Out[11]:
   A
0  1
1  1
2  1

@jreback
Copy link
Contributor Author

jreback commented May 30, 2019

yeah i can do via blocks

array_ufunc is orthogonal here - i don’t think we have any implementation yet (should though)

@jreback
Copy link
Contributor Author

jreback commented Jun 2, 2019

fixed up the casting, a bit tricky.

@@ -5755,6 +5736,11 @@ def astype(self, dtype, copy=True, errors='raise', **kwargs):
**kwargs)
return self._constructor(new_data).__finalize__(self)

if not results:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What case hits this? I'm not immediately seeing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty frames :->

@@ -104,12 +104,19 @@ def _cython_agg_blocks(self, how, alt=None, numeric_only=True,

obj = self.obj[data.items[locs]]
s = groupby(obj, self.grouper)
result = s.aggregate(lambda x: alt(x, axis=self.axis))
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example? I'm also having trouble seeing this.

"""
try:
result = self._holder._from_sequence(
np.asarray(result).ravel(), dtype=dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the asarray and ravel needed? _from_sequence should take any sequence, so the trip through NumPy shouldn't be needed. Can result be 2-D here? I can't imagine us wanting to ravel a 2-D array, since the length will change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments (short answer is this could be a 2-D numpy array or a EA which is already 1-D)

'std',
'var',
'sem',
pytest.param('median', marks=pytest.mark.xfail(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a know issue we have an issue open for already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no was discovered here, but I can't repro locally

@@ -104,12 +104,19 @@ def _cython_agg_blocks(self, how, alt=None, numeric_only=True,

obj = self.obj[data.items[locs]]
s = groupby(obj, self.grouper)
result = s.aggregate(lambda x: alt(x, axis=self.axis))
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If those return NotImplementedError can we limit the scope of the catching to just that?

@@ -1319,6 +1328,16 @@ def f(self, **kwargs):
except Exception:
result = self.aggregate(
lambda x: npfunc(x, axis=self.axis))

# coerce the columns if we can
if isinstance(result, DataFrame):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this logic be in _try_cast? OK as a follow up just want to confirm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, _try_cast is for 1d results, maybe I can make this cleaner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sligthly refactored

@jreback
Copy link
Contributor Author

jreback commented Jun 8, 2019

push should make this pass, will address comments soon.

@jreback
Copy link
Contributor Author

jreback commented Jun 27, 2019

replacing with a PR just with the categorical grouping changes; the sparse ufuncs is conflicting with what @TomAugspurger is doing and is messy.

@jreback jreback closed this Jun 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type ExtensionArray Extending pandas with custom dtypes or arrays. Groupby Sparse Sparse Data Type
Projects
None yet
3 participants