Skip to content

REGR: properly update DataFrame cache in Series.__setitem__ #48215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Aug 23, 2022

Potential fix for #47172

#43406 added an inplace keyword to _maybe_update_cacher, but made the default False. While before that PR, the effective behaviour would rather match True I think (that PR was meant to make some cases not inplace).
However, if we are in the case of Series.__setitem__ needing a cache update, so meaning that this series is a column of a DataFrame and we might need to update that DataFrame, I think we always should try update that DataFrame inplace (we are not in a DataFrame.__setitem__ case, but rather setting into a column we got with DataFrame.__getitem__, so chained assignment, which currently still is expected to work in this case).

@jorisvandenbossche jorisvandenbossche added this to the 1.4.4 milestone Aug 23, 2022
@mroeschke mroeschke added the Indexing Related to indexing on series/frames, not to indexes themselves label Aug 23, 2022
@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review August 23, 2022 22:15
@jorisvandenbossche
Copy link
Member Author

I am not fully sure how to exactly test #47172 (but I verified manually that it fixes the memory increase), but I did add a test that fails on main and 1.4 (and worked on pandas 1.3) and passes with this fix.

@simonjayhawkins
Copy link
Member

Potential fix for #47172

#43406 added an inplace keyword to _maybe_update_cacher, but made the default False. While before that PR, the effective behaviour would rather match True I think (that PR was meant to make some cases not inplace). However, if we are in the case of Series.__setitem__ needing a cache update, so meaning that this series is a column of a DataFrame and we might need to update that DataFrame, I think we always should try update that DataFrame inplace (we are not in a DataFrame.__setitem__ case, but rather setting into a column we got with DataFrame.__getitem__, so chained assignment, which currently still is expected to work in this case).

Thanks for the explanation. makes sense.

I am not fully sure how to exactly test #47172 (but I verified manually that it fixes the memory increase), but I did add a test that fails on main and 1.4 (and worked on pandas 1.3) and passes with this fix.

looks fine.

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorisvandenbossche lgtm pending green

@@ -26,6 +26,7 @@ Fixed regressions
- Fixed regression in :meth:`DataFrame.loc` setting a length-1 array like value to a single value in the DataFrame (:issue:`46268`)
- Fixed regression when slicing with :meth:`DataFrame.loc` with :class:`DateOffset`-index (:issue:`46671`)
- Fixed regression in setting ``None`` or non-string value into a ``string``-dtype Series using a mask (:issue:`47628`)
- Fixed regression in updating a DataFrame column through :meth:`Series.__setitem__` (:issue:`47172`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Series.__setitem__ doesn't render? so perhaps reword and maybe should also mention memory leak or that the values were not updated inplace using chained indexing with assignment?

@@ -1235,3 +1235,21 @@ def test_setitem_not_operating_inplace(self, value, set_value, indexer):
view = df[:]
df[indexer] = set_value
tm.assert_frame_equal(view, expected)

@td.skip_array_manager_invalid_test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this skips whereas skip_array_manager_not_yet_implemented xfails.

Now this test fails with AttributeError: 'ArrayManager' object has no attribute 'blocks' since we are explicitly testing the BlockManager values.

so it does make sense to just skip.

but are we likely to want a similar test for array manger?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it indeed makes sense to skip this test instead of xfail, since the way is written is not intended to ever work for ArrayManager. But so we could indeed write a separate test that checks for the the ArrayManager case (but I would say it's maybe not a super high priority, I can only take a look in the weekend).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I would say it's maybe not a super high priority

no problem. not a blocker.

@simonjayhawkins simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Aug 25, 2022
@simonjayhawkins
Copy link
Member

@meeseeksdev backport 1.5.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants