BUG: preserve categorical & sparse types when grouping / pivot & preserve dtypes on ufuncs #26550

jreback · 2019-05-29T01:07:08Z

codecov · 2019-05-29T01:46:21Z

Codecov Report

Merging #26550 into master will decrease coverage by 0.02%.
The diff coverage is 88.88%.

@@            Coverage Diff             @@
##           master   #26550      +/-   ##
==========================================
- Coverage   91.77%   91.74%   -0.03%     
==========================================
  Files         174      174              
  Lines       50642    50666      +24     
==========================================
+ Hits        46476    46484       +8     
- Misses       4166     4182      +16

Flag	Coverage Δ
#multiple	`90.28% <88.88%> (-0.02%)`	⬇️
#single	`41.66% <44.44%> (-0.12%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/groupby/generic.py	`88.07% <100%> (-0.89%)`	⬇️
pandas/core/series.py	`93.63% <100%> (+0.01%)`	⬆️
pandas/core/frame.py	`97.02% <100%> (-0.11%)`	⬇️
pandas/core/groupby/ops.py	`96% <100%> (ø)`	⬆️
pandas/core/nanops.py	`94.11% <100%> (ø)`	⬆️
pandas/core/groupby/groupby.py	`97.39% <100%> (+0.16%)`	⬆️
pandas/core/dtypes/cast.py	`91.54% <100%> (ø)`	⬆️
pandas/core/generic.py	`93.42% <25%> (-0.15%)`	⬇️
pandas/core/internals/blocks.py	`93.94% <71.42%> (-0.19%)`	⬇️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7629a18...f397908. Read the comment docs.

codecov · 2019-05-29T01:46:24Z

Codecov Report

Merging #26550 into master will decrease coverage by 0.02%.
The diff coverage is 92.3%.

@@            Coverage Diff             @@
##           master   #26550      +/-   ##
==========================================
- Coverage   91.97%   91.95%   -0.03%     
==========================================
  Files         180      180              
  Lines       50756    50789      +33     
==========================================
+ Hits        46685    46705      +20     
- Misses       4071     4084      +13

Flag	Coverage Δ
#multiple	`90.55% <92.3%> (-0.02%)`	⬇️
#single	`41.81% <41.53%> (-0.1%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/sparse/frame.py	`95.67% <100%> (+0.03%)`	⬆️
pandas/core/groupby/generic.py	`88.48% <100%> (-0.86%)`	⬇️
pandas/core/series.py	`93.69% <100%> (+0.02%)`	⬆️
pandas/core/dtypes/cast.py	`90.53% <100%> (ø)`	⬆️
pandas/core/nanops.py	`94.76% <100%> (ø)`	⬆️
pandas/core/frame.py	`96.9% <100%> (-0.11%)`	⬇️
pandas/core/groupby/ops.py	`96% <100%> (ø)`	⬆️
pandas/core/groupby/groupby.py	`97.46% <100%> (+0.29%)`	⬆️
pandas/core/arrays/sparse.py	`93.88% <100%> (+0.15%)`	⬆️
pandas/core/internals/construction.py	`96.19% <100%> (+0.25%)`	⬆️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b9b081d...4bd486e. Read the comment docs.

WillAyd · 2019-05-29T03:27:37Z

pandas/core/groupby/generic.py

@@ -104,12 +104,19 @@ def _cython_agg_blocks(self, how, alt=None, numeric_only=True,

                obj = self.obj[data.items[locs]]
                s = groupby(obj, self.grouper)
-                result = s.aggregate(lambda x: alt(x, axis=self.axis))
+                try:


What's this for? Not immediately obvious the link between this and the overall PR

this handles blocks that return NotImplementedError and then cann't be aggregated, e.g. Categoricals with string categories, aggregating with mean for exampel (and numeric_only=False) is passed

Can you give an example? I'm also having trouble seeing this.

If those return NotImplementedError can we limit the scope of the catching to just that?

pandas/core/series.py

pandas/tests/groupby/test_function.py

pandas/tests/sparse/frame/test_analytics.py

pandas/core/frame.py

doc/source/whatsnew/v0.25.0.rst

TomAugspurger · 2019-05-30T22:42:19Z

@jreback rough sketch of the idea in #26550 (comment)

diff --git a/pandas/core/generic.py b/pandas/core/generic.py
index ce1cb37ab..898d03e6c 100644
--- a/pandas/core/generic.py
+++ b/pandas/core/generic.py
@@ -1945,6 +1945,10 @@ class NDFrame(PandasObject, SelectionMixin):
     # GH#23114 Ensure ndarray.__op__(DataFrame) returns NotImplemented
     __array_priority__ = 1000
 
+    def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
+        result = self._data.map_blocks(ufunc, **kwargs)
+        return type(self)(result)
+
     def __array__(self, dtype=None):
         return com.values_from_object(self)
 
diff --git a/pandas/core/internals/managers.py b/pandas/core/internals/managers.py
index 0b63588c9..9c7cc4bd5 100644
--- a/pandas/core/internals/managers.py
+++ b/pandas/core/internals/managers.py
@@ -117,6 +117,10 @@ class BlockManager(PandasObject):
 
         self._rebuild_blknos_and_blklocs()
 
+    def map_blocks(self, func, **kwargs):
+        newbs = [make_block(func(block.values, **kwargs), block.mgr_locs) for block in self.blocks]
+        return type(self)(newbs, self.axes)
+
     def make_empty(self, axes=None):
         """ return an empty BlockManager with the items axis of len 0 """
         if axes is None:

usage:

In [1]: import numpy as np; import pandas as pd

In [2]: df = pd.DataFrame({"A": [1, 0]}, dtype=pd.SparseDtype('int'))

In [3]: np.sin(df)
Out[3]:
          A
0  0.841471
1  0.000000

In [4]: np.sin(df).dtypes
Out[4]:
A    Sparse[float64, 0.0]
dtype: object

The __array_ufunc__ isn't really correct yet, but does make sense? I think that's much cleaner than doing the op and trying to cast back.

On this branch, the casting seems problematic:

In [10]: df = pd.DataFrame({"A": [1, 2, 3]}, dtype=pd.SparseDtype('int'))

In [11]: np.sqrt(df)
Out[11]:
   A
0  1
1  1
2  1

jreback · 2019-05-30T22:46:11Z

yeah i can do via blocks

array_ufunc is orthogonal here - i don’t think we have any implementation yet (should though)

jreback · 2019-06-02T21:05:16Z

fixed up the casting, a bit tricky.

doc/source/whatsnew/v0.25.0.rst

pandas/core/frame.py

TomAugspurger · 2019-06-03T02:36:57Z

pandas/core/generic.py

@@ -5755,6 +5736,11 @@ def astype(self, dtype, copy=True, errors='raise', **kwargs):
                                         **kwargs)
            return self._constructor(new_data).__finalize__(self)

+        if not results:


What case hits this? I'm not immediately seeing it.

empty frames :->

TomAugspurger · 2019-06-03T02:40:35Z

pandas/core/groupby/generic.py

@@ -104,12 +104,19 @@ def _cython_agg_blocks(self, how, alt=None, numeric_only=True,

                obj = self.obj[data.items[locs]]
                s = groupby(obj, self.grouper)
-                result = s.aggregate(lambda x: alt(x, axis=self.axis))
+                try:


Can you give an example? I'm also having trouble seeing this.

TomAugspurger · 2019-06-03T02:45:58Z

pandas/core/internals/blocks.py

+        """
+        try:
+            result = self._holder._from_sequence(
+                np.asarray(result).ravel(), dtype=dtype)


Why is the asarray and ravel needed? _from_sequence should take any sequence, so the trip through NumPy shouldn't be needed. Can result be 2-D here? I can't imagine us wanting to ravel a 2-D array, since the length will change.

I added some comments (short answer is this could be a 2-D numpy array or a EA which is already 1-D)

TomAugspurger · 2019-06-03T02:48:16Z

pandas/tests/sparse/test_pivot.py

+         'std',
+         'var',
+         'sem',
+         pytest.param('median', marks=pytest.mark.xfail(


Is this a know issue we have an issue open for already?

no was discovered here, but I can't repro locally

WillAyd · 2019-06-05T12:24:01Z

pandas/core/groupby/generic.py

@@ -104,12 +104,19 @@ def _cython_agg_blocks(self, how, alt=None, numeric_only=True,

                obj = self.obj[data.items[locs]]
                s = groupby(obj, self.grouper)
-                result = s.aggregate(lambda x: alt(x, axis=self.axis))
+                try:


If those return NotImplementedError can we limit the scope of the catching to just that?

WillAyd · 2019-06-05T12:26:47Z

pandas/core/groupby/groupby.py

@@ -1319,6 +1328,16 @@ def f(self, **kwargs):
                except Exception:
                    result = self.aggregate(
                        lambda x: npfunc(x, axis=self.axis))
+
+                    # coerce the columns if we can
+                    if isinstance(result, DataFrame):


Shouldn't this logic be in _try_cast? OK as a follow up just want to confirm

no, _try_cast is for 1d results, maybe I can make this cleaner

I sligthly refactored

pandas/tests/groupby/test_function.py

jreback · 2019-06-08T23:32:19Z

push should make this pass, will address comments soon.

preserve dtypes when applying a ufunc to a sparse dtype closes pandas-dev#18502 closes pandas-dev#23743

jreback · 2019-06-27T02:15:12Z

replacing with a PR just with the categorical grouping changes; the sparse ufuncs is conflicting with what @TomAugspurger is doing and is messy.

jreback added Bug Groupby Sparse Sparse Data Type Categorical Categorical Data Type ExtensionArray Extending pandas with custom dtypes or arrays. labels May 29, 2019

jreback added this to the 0.25.0 milestone May 29, 2019

jreback changed the title ~~BUG: preserve categorical & sparse types when grouping / pivot~~ BUG: preserve categorical & sparse types when grouping / pivot & preserve dtypes on ufuncs May 29, 2019

WillAyd requested changes May 29, 2019

View reviewed changes

jreback force-pushed the bugs branch from d1bd6ba to 199d68c Compare May 30, 2019 00:54

TomAugspurger reviewed May 30, 2019

View reviewed changes

pandas/core/frame.py Show resolved Hide resolved

doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved

jreback force-pushed the bugs branch from 1185799 to 1afad6f Compare June 2, 2019 20:01

TomAugspurger reviewed Jun 3, 2019

View reviewed changes

WillAyd reviewed Jun 5, 2019

View reviewed changes

jreback force-pushed the bugs branch from d02b811 to c638d29 Compare June 8, 2019 23:31

jreback force-pushed the bugs branch from c638d29 to df05308 Compare June 9, 2019 23:13

jreback added 9 commits June 20, 2019 22:32

BUG: preserve categorical & sparse types when grouping / pivot

5bec6b3

preserve dtypes when applying a ufunc to a sparse dtype closes pandas-dev#18502 closes pandas-dev#23743

lint

d1490a2

review comments

561e960

use marks

28be4d9

review comments

d6db2ea

allow coercing casting

7c29393

infer types

0662f2b

sparse masking

c75461c

fix float casting

86090bf

review comments

4bd486e

jreback force-pushed the bugs branch from 2f9259c to 4bd486e Compare June 21, 2019 02:36

jreback mentioned this pull request Jun 27, 2019

BUG: preserve categorical & sparse types when grouping / pivot #27071

Merged

jreback closed this Jun 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: preserve categorical & sparse types when grouping / pivot & preserve dtypes on ufuncs #26550

BUG: preserve categorical & sparse types when grouping / pivot & preserve dtypes on ufuncs #26550

jreback commented May 29, 2019 •

edited

Loading

codecov bot commented May 29, 2019

codecov bot commented May 29, 2019 •

edited

Loading

WillAyd May 29, 2019

jreback May 30, 2019

TomAugspurger Jun 3, 2019

WillAyd Jun 5, 2019

TomAugspurger commented May 30, 2019 •

edited

Loading

jreback commented May 30, 2019

jreback commented Jun 2, 2019

TomAugspurger Jun 3, 2019

jreback Jun 10, 2019

TomAugspurger Jun 3, 2019

TomAugspurger Jun 3, 2019

jreback Jun 10, 2019

TomAugspurger Jun 3, 2019

jreback Jun 10, 2019

WillAyd Jun 5, 2019

WillAyd Jun 5, 2019

jreback Jun 9, 2019

jreback Jun 10, 2019

jreback commented Jun 8, 2019

jreback commented Jun 27, 2019

BUG: preserve categorical & sparse types when grouping / pivot & preserve dtypes on ufuncs #26550

BUG: preserve categorical & sparse types when grouping / pivot & preserve dtypes on ufuncs #26550

Conversation

jreback commented May 29, 2019 • edited Loading

codecov bot commented May 29, 2019

Codecov Report

codecov bot commented May 29, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented May 30, 2019 • edited Loading

jreback commented May 30, 2019

jreback commented Jun 2, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jun 8, 2019

jreback commented Jun 27, 2019

jreback commented May 29, 2019 •

edited

Loading

codecov bot commented May 29, 2019 •

edited

Loading

TomAugspurger commented May 30, 2019 •

edited

Loading