Inconsistencies between python/cython groupby.agg behavior #35928

jbrockmendel · 2020-08-27T16:08:57Z

This is pretty ugly, but tentatively is sufficient to make #34997 pass.

The upshot is that we have two problems:

in libreduction setting setattr(cached_ityp, '_index_data', islider.buf) silently does the wrong thing for EA-backed indexes
when we go through the non-libreduction path, we do things slightly differently, which requires more patches to get tests passing.

cc @WillAyd

…f-blockwise-3

…dev#35799)

…ster

…tch-less

jbrockmendel · 2020-09-02T01:21:34Z

pandas/core/groupby/generic.py


            index = Index(sorted(result), name=self.grouper.names[0])
+            if isinstance(index, (DatetimeIndex, TimedeltaIndex)):
+                # TODO: do we _always_ want to do this?
+                #  shouldnt this be done later in eg _wrap_aggregated_output?


@WillAyd it looks like this whole block L288-L307 can be replaced by setting index = self.grouper.result_index without breaking any tests. That de-facto works for existing tests, but I'd like to confirm that it works in general. do you have a good read on if it does?

I'd say go for it if it passes the tests - it's certainly a lot simpler

jbrockmendel · 2020-09-02T22:38:56Z

pandas/core/groupby/generic.py

+                    result = self._aggregate_maybe_named(func, *args, **kwargs)
+
+            index = self.grouper.result_index
+            assert index.name == self.grouper.names[0]


i guess this would fail if we had a name that was NA

…tch-less

jreback

you are right, this is not pretty.

does this have a test that is added that doesn't work now?

jbrockmendel · 2020-09-05T03:37:29Z

you are right, this is not pretty.

it started off even worse.

does this have a test that is added that doesn't work now?

No. The motivation is to get #34997 working (which is needed in order to upgrade to cython3 when its eventually released)

jbrockmendel · 2020-09-06T03:09:07Z

This is also prerequisite to killing off the _index_data kludge

…tch-less

jreback · 2020-09-13T11:19:33Z

if u can rebase will have a look

…tch-less

jbrockmendel · 2020-09-14T01:45:29Z

rebased per request

jreback

do we have tests that show the inconsistency now? and how does this fix it?

jreback · 2020-09-15T02:13:46Z

pandas/core/groupby/generic.py

            ret = create_series_with_explicit_dtype(
                result, index=index, dtype_if_empty=object
            )
+            ret.name = self._selected_obj.name  # test_metadata_propagation_indiv


can you pass the name in L284

jreback · 2020-09-15T02:14:55Z

pandas/core/groupby/generic.py

+        what libreduction does.
+        """
+        try:
+            return self._aggregate_named(func, *args, named=True, **kwargs)


what is this actually doing?

trying to track down an answer to why i did this, may take a bit

OK now i recall. in e.g. test_apply_columns_multilevel we do:

cols = pd.MultiIndex.from_tuples([("A", "a", "", "one"), ("B", "b", "i", "two")]) ind = date_range(start="2017-01-01", freq="15Min", periods=8) df = DataFrame(np.array([0] * 16).reshape(8, 2), index=ind, columns=cols) agg_dict = {col: (np.sum if col[3] == "one" else np.mean) for col in df.columns} result = df.resample("H").apply(lambda x: agg_dict[x.name](x))

so the function we are passing to apply is lambda x: agg_dict[x.name](x) depends on the Series name x.name but what we're actually doing ATM (and since this is tested, i guess it means we support it on purpose?) is patching the Series name to match the group name so that we end up applying a different function to each group.

But the user could also pass a function that depends on the non-patched Series name, so we end up guessing which regime we're in, which this is doing.

If it were up to me I'd shoot this name-patching thing into the sun.

grr, this is really complex.

jbrockmendel · 2020-09-15T02:32:09Z

do we have tests that show the inconsistency now? and how does this fix it?

This does not fix anything that is broken in master. But under #34997 (which is needed for cy3 compat), some paths that currently go through cython will instead go through python, and would then fail.

jreback · 2020-09-15T22:37:29Z

do we have tests that show the inconsistency now? and how does this fix it?

This does not fix anything that is broken in master. But under #34997 (which is needed for cy3 compat), some paths that currently go through cython will instead go through python, and would then fail.

maybe i am not following everything, but do we need to use cy3 in 3.9?

jbrockmendel · 2020-09-15T23:44:07Z

maybe i am not following everything, but do we need to use cy3 in 3.9?

i hope not; cy3 isnt even out yet

jreback · 2020-09-15T23:48:04Z

maybe i am not following everything, but do we need to use cy3 in 3.9?

i hope not; cy3 isnt even out yet

so we don't need this rn then?

jbrockmendel · 2020-09-15T23:48:57Z

so we don't need this rn then?

correct AFAIK

jbrockmendel · 2020-09-17T20:56:40Z

closing in favor of #34997

jbrockmendel and others added 20 commits August 20, 2020 21:19

REF: remove unnecesary try/except

4c5eddd

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

c632c9f

…f-blockwise-3

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

9e64be3

…f-blockwise-3

TST: add test for agg on ordered categorical cols (pandas-dev#35630)

42649fb

TST: resample does not yield empty groups (pandas-dev#10603) (pandas-…

47121dd

…dev#35799)

revert accidental rebase

1decb3e

Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…

57c5dd3

…ster

Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…

a358463

…ster

Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…

ffa7ad7

…ster

Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…

e5e98d4

…ster

Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…

408db5a

…ster

Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…

d3493cf

…ster

Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…

75a805a

…ster

Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…

9f61070

…ster

Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…

2d10f6e

…ster

Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…

3e20187

…ster

REF/BUG: don't go through cython for EA indexes

51205a5

Implement _aggregate_maybe_named

f453c5b

de-duplicate

2ae2124

avoid passing RangeIndex to libreduction

98a91a3

jbrockmendel mentioned this pull request Aug 27, 2020

REF: dont set ndarray.data in libreduction #34997

Closed

jbrockmendel added 2 commits August 30, 2020 10:58

Merge branch 'master' of https://github.com/pandas-dev/pandas into ca…

5f73b03

…tch-less

Merge branch 'master' of https://github.com/pandas-dev/pandas into ca…

065fc69

…tch-less

jbrockmendel commented Sep 2, 2020

View reviewed changes

simplify

c230f72

jbrockmendel commented Sep 2, 2020

View reviewed changes

jbrockmendel changed the title ~~WIP: inconsistencies between python/cython groupby.agg behavior~~ Inconsistencies between python/cython groupby.agg behavior Sep 3, 2020

Merge branch 'master' of https://github.com/pandas-dev/pandas into ca…

bf2e171

…tch-less

jreback requested changes Sep 5, 2020

View reviewed changes

jreback added the Groupby label Sep 5, 2020

Merge branch 'master' of https://github.com/pandas-dev/pandas into ca…

a4bcf43

…tch-less

jbrockmendel mentioned this pull request Sep 12, 2020

BLD/CI: 3.9 support #36296

Closed

2 tasks

Merge branch 'master' of https://github.com/pandas-dev/pandas into ca…

607aea8

…tch-less

jreback requested changes Sep 15, 2020

View reviewed changes

jbrockmendel closed this Sep 17, 2020

jbrockmendel deleted the catch-less branch September 17, 2020 20:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistencies between python/cython groupby.agg behavior #35928

Inconsistencies between python/cython groupby.agg behavior #35928

jbrockmendel commented Aug 27, 2020

jbrockmendel Sep 2, 2020

WillAyd Sep 2, 2020

jbrockmendel Sep 2, 2020

jreback left a comment

jbrockmendel commented Sep 5, 2020

jbrockmendel commented Sep 6, 2020

jreback commented Sep 13, 2020

jbrockmendel commented Sep 14, 2020

jreback left a comment

jreback Sep 15, 2020

jreback Sep 15, 2020

jbrockmendel Sep 15, 2020

jbrockmendel Sep 15, 2020

jreback Sep 15, 2020

jbrockmendel commented Sep 15, 2020

jreback commented Sep 15, 2020

jbrockmendel commented Sep 15, 2020

jreback commented Sep 15, 2020

jbrockmendel commented Sep 15, 2020

jbrockmendel commented Sep 17, 2020

Inconsistencies between python/cython groupby.agg behavior #35928

Inconsistencies between python/cython groupby.agg behavior #35928

Conversation

jbrockmendel commented Aug 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jbrockmendel commented Sep 5, 2020

jbrockmendel commented Sep 6, 2020

jreback commented Sep 13, 2020

jbrockmendel commented Sep 14, 2020

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Sep 15, 2020

jreback commented Sep 15, 2020

jbrockmendel commented Sep 15, 2020

jreback commented Sep 15, 2020

jbrockmendel commented Sep 15, 2020

jbrockmendel commented Sep 17, 2020