-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Inconsistencies between python/cython groupby.agg behavior #35928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/groupby/generic.py
Outdated
|
||
index = Index(sorted(result), name=self.grouper.names[0]) | ||
if isinstance(index, (DatetimeIndex, TimedeltaIndex)): | ||
# TODO: do we _always_ want to do this? | ||
# shouldnt this be done later in eg _wrap_aggregated_output? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd it looks like this whole block L288-L307 can be replaced by setting index = self.grouper.result_index
without breaking any tests. That de-facto works for existing tests, but I'd like to confirm that it works in general. do you have a good read on if it does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say go for it if it passes the tests - it's certainly a lot simpler
result = self._aggregate_maybe_named(func, *args, **kwargs) | ||
|
||
index = self.grouper.result_index | ||
assert index.name == self.grouper.names[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess this would fail if we had a name that was NA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, this is not pretty.
does this have a test that is added that doesn't work now?
it started off even worse.
No. The motivation is to get #34997 working (which is needed in order to upgrade to cython3 when its eventually released) |
This is also prerequisite to killing off the _index_data kludge |
if u can rebase will have a look |
rebased per request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have tests that show the inconsistency now? and how does this fix it?
ret = create_series_with_explicit_dtype( | ||
result, index=index, dtype_if_empty=object | ||
) | ||
ret.name = self._selected_obj.name # test_metadata_propagation_indiv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you pass the name in L284
what libreduction does. | ||
""" | ||
try: | ||
return self._aggregate_named(func, *args, named=True, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this actually doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trying to track down an answer to why i did this, may take a bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK now i recall. in e.g. test_apply_columns_multilevel we do:
cols = pd.MultiIndex.from_tuples([("A", "a", "", "one"), ("B", "b", "i", "two")])
ind = date_range(start="2017-01-01", freq="15Min", periods=8)
df = DataFrame(np.array([0] * 16).reshape(8, 2), index=ind, columns=cols)
agg_dict = {col: (np.sum if col[3] == "one" else np.mean) for col in df.columns}
result = df.resample("H").apply(lambda x: agg_dict[x.name](x))
so the function we are passing to apply is lambda x: agg_dict[x.name](x)
depends on the Series name x.name
but what we're actually doing ATM (and since this is tested, i guess it means we support it on purpose?) is patching the Series name to match the group name so that we end up applying a different function to each group.
But the user could also pass a function that depends on the non-patched Series name, so we end up guessing which regime we're in, which this is doing.
If it were up to me I'd shoot this name-patching thing into the sun.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grr, this is really complex.
This does not fix anything that is broken in master. But under #34997 (which is needed for cy3 compat), some paths that currently go through cython will instead go through python, and would then fail. |
maybe i am not following everything, but do we need to use cy3 in 3.9? |
i hope not; cy3 isnt even out yet |
so we don't need this rn then? |
correct AFAIK |
closing in favor of #34997 |
This is pretty ugly, but tentatively is sufficient to make #34997 pass.
The upshot is that we have two problems:
setattr(cached_ityp, '_index_data', islider.buf)
silently does the wrong thing for EA-backed indexescc @WillAyd