-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
COMPAT: remove SettingWithCopy warning, and use copy-on-write, #10954 #10973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It looks like this does deal with @nickeubank's most recent issue, where changes on the original dataframe don't change the intermediate, even if it was created with a view? How do you do that?
It looks like currently accessing a column as a series returns a view, which is good. But is also seems like accessing a row as a series can sometimes also return a view, which is not what we want (that should always be a copy or copy-on-write):
|
|
||
# we have a chained assignment | ||
# assign back to the original | ||
self.is_copy().loc[self.index,key] = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better name for is_copy()
if it returns the parent, then parent()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this is a confusing name. What is the value of this variable? Maybe get_parent()
would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeh I can prob just change that. I left for now. Its an attribute that is None
or a weakref
also deprecated the implementation is this: every type of operation that returns something which is not a 'view', IOW, not a column or a slice of a column (for a DataFrame). sets
Then [3] will do a check (so all actual set operations, including inplace) will see that we have a ref. So do this funky Now we don't actually see the the hard part was actually putting in place all of the checks on when to record the |
This solution is pretty dark magic. Even though it is (sort of) technically feasible, I'm not sure it's a good idea. People certainly do put DataFrames in built-in data structures like lists, so @JanSchulz's example where this breaks will assuredly come up:
Maybe another approach would be to try to make views/copies more predictable, by, for example, disabling automatic block consolidation? Another possibility would be to make creating a view turn both the original object AND the new selection into copy-on-write. This would eliminate the need to look at the garbage collector, though it could make repeated modification/indexing of a dataframe very inefficient, e.g., df = pd.DataFrame({'A': np.random.randn(10000)})
for i in df.index:
# this assignment now does a complete copy every time!
df.loc[i, 'B'] = df.loc[i:(i + 1), 'A'].sum() > 0 On the other hand, R users seem to deal with copy-on-write like this just fine.... |
@shoyer consolidation are on the agenda, but certainly not in the next week. they are a whole can of worms. |
@jreback wow, this is amazing. I can't believe you implemented this in a day! I'll leave the issues of internal implementation to you more experienced developers, but will be sure to help with the doc updates when we settle on final behaviors! |
def _get_is_copy(self): | ||
warnings.warn("is_copy is deprecated will be removed in a future release", | ||
FutureWarning) | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, but if this really internal property is basically now meaningless, then IMO it would be better to cleanly break code which depends on this (=remove the property without warning) than just change the API contract of this property...
Here is a bit more magic (as in "I don't really understand all what's happening here, but it seems to work...") which gets even references to df slices in lists/dicts or attributes in objects:
Examples:
The performance penalty of The three additional lines in each method in |
@jreback is this planned for 0.17? My first-take reaction to skimming the implementation is worry that it's fragile and will be hard to maintain (not that I'll really be the on maintaining it 😄). |
@TomAugspurger I didn't tag it. Still needs some work. Its not much more fragile that the current Let me see how much can fix for next week. |
I would think it's not yet for 0.17? Certainly if we want to get a release candidate out soon to have a release the coming month. For such a change, I think we should take some more time. |
yes 0.17 is already quite full and has lots of depreciations and such not sure of philosophy for changes in releases |
As I noted before, as much as I want this to work, I have serious reservations about the approach here. As frustrating as the status quo is, if we can't do copy-on-write right, maybe we shouldn't do it at all... |
Explicitly calling It seems like a bad idea to bake this into every |
thrn kindly shoe another way to do this correctness is the most important thing here |
Would it be possible to simplify the implementation of we didn't try to
|
maybe, put this aside for bit. as going to do the release candidate for 0.17.0 shortly. @nickeubank fee free to give a go! |
I think that makes sense -- not something to rush! @jreback will definitely take a look! This will definitely be the farthest I've gone into the pandas internals, but worth a shot. :) |
@shoyer If we dropped the Re: Also relatedly: weren't we using |
My inclination is that a better approach would be to come up with a simpler, more consistent set of rules that determine whether we make copies or views. This is closer to the approach of NumPy and xray. We'll hurt performance when we turn some cases that currently use views into copies, but it will be a net win for predictability. |
How about adding two (additional) entries: one for setting, one for getting? Setting fails if a copy must be made, getting always returns a copy? This is kind of like the |
@shoyer Ah, I see -- thank you for clarifying! Helpful to know your specific concerns. I agree we don't want to do anything that will introduce new fragilities, but I'm not quite ready to abandon copy-on-write just because our first attempt at implementation failed. As @jreback, there are some pretty huge performance gains from views, and if we can find a way to keep them when possible it'd be nice. Now that I understand the source of the concern, let me give a little more thought to alternatives -- if our only major concern is the reliance on Out of curiosity, as we're getting pretty deep into python internals, is there a forum where we might be able to seek guidance from people who do work on the language itself, rather than higher-level tools like Similarly, any idea if this gets better in Python 3? I recognize Python 2 is likely to hang around for a while, so even if the @JanSchulz I think different indexers is a reasonable solution down the road, btw! Not quite ready to go there yet, but a good thought! |
Brainstorming a little: the primary role of Rather than trying to accomplish the redirect from Relatedly: this seems perfectly analogous to the behavior provided by |
CMIIW, I don't have a better solution to make all that work, and there probably isn't one. @nickeubank Assignment in Python has no modification hooks, so I think even if the C API was exposing more functionalities, this would still not be easy to pull off (assignment still happens in Python). AFAIK P3 is no different, and we shouldn't be relying on language internals anyways. If we want |
@kawochen Thanks for the followup! So would we be ok (i.e. not need This has come up before, and I think most people are ok with EDIT for clarity: By "ok", I mean can we do copy-on-write safely without replying on the |
OK lemme read all the discussion in the original issue before polluting this PR with my nonsense--I don't seem to fully understand what the end goal is. |
:) @kawochen haha - ok! The issue that gave rise to this (#10954) may be more helpful than this PR. This PR is amazing, but also did more than I think was required in that it would support chain indexing, but if we can support copy-on-write if we drop that, that'd be great! (Do not take confusion on my part as a sign you don't know what's going on -- I'm very interested in this from a design perspective, but am a little out of my depth with the machinery) |
@kawochen did you ever have a chance to look this over and figure out if you'd found a solution? |
@shoyer @jreback Reading this over more carefully, I'm pretty sure that we only use gc to make it possible to do chained indexing on a single line (i.e. In light of that, might I suggest a way forward?
By the latter, I mean preventing the following:
I've been trying to do a little implementation myself, but I'm not sure I'm quite capable yet. Nevertheless, here are some thoughts I've been wrestling with:
|
@nickeubank I'm still not sure how to reconcile the various 'desired' features with how Python works. Forbidding chained indexing sounds like a huge change. |
@kawochen I'm not sure I'd characterize it as a huge change since we don't really support it now -- it sometimes works, but not consistently. In fact, that's part of what started this discussion (see #10954 ) I think the consensus from the prior discussion (@jreback @shoyer can weigh in if they want) is:
@jreback made this implementation that preserved chained-indexing, but it seems to rely on |
It's simply not possible to make chained indexing always work if indexing can ever return a copy instead of a view. This is the limitation of how Python works. Nor is it really possible or desirable to make chained indexing always fail, unless we remove all use of views (which is a non-starter). Copy on write isn't really feasible with mainstream methods in Python. The most important thing is that chained indexing should fail or succeed predictably. It doesn't need to always fail, but the few cases where it will succeed should be obviously identifiable. |
Hi! I understand this is a very fundamental discussion, but even semi-chaining triggers the warning at the moment:
The results are still good though (it did save the result). Maybe there can be some way where a user can enforce a copy (as at least a temporary solution?) BR Carst |
@CarstVaartjes Yes, that is exactly the sort of useless warning we would like to remove. If you type |
pass | ||
p = self._parent[0]() | ||
if p is not None: | ||
names = get_names_for_obj(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback If the fragile get_names_for_obj
is merely for prohibiting chained indexing, then perhaps we can get rid of it altogether? With this change, chained indexing will always fail (due to copy-on-write), so I don't think we actually need this anymore.
Is there a summary somewhere of exactly what the semantics of the proposed new behavior are? Currently I think there are some situations where you get SettingWithCopy warning, but the value is in fact set on the "original" object. What worries me is the possibility that with this change, those will now work by copying the original data. Is the idea that this fix will get smarter and not copy anything in those false-alarm cases, and only copy stuff when it really needs to? Also, when working with large DataFrames in memory, a silent copy-on-write could bring things to a screeching halt. There should be some way for people to say "don't copy anything unless I tell you to". |
Hi @BrenBarn The discussions are currently spread across one issue #10954 and two PRs (#10973 and another implementation underway #11207). The basic consensus is that we'd love to preserve views wherever possible, but only in situation where we can consistently offer access to views. The problem at the moment is that pandas is sometimes offering views and sometimes not (as you say, there are some false alarms -- full discussion at top of #10954 ), but it's effectively unpredictable when those occur. Preserving full column slices as views, for example, we can always do, and we're working to preserve. |
closes #10954
mode.chained_assignment
as its now unused (it shows the deprecation if you are explicity using it, more friendly this way).NDFrame.is_copy
propertySettingWithCopyWarning
entirelyTODO:
iloc/loc/[]
, chain, and set various values including dtype changessemantics:
Intermediate assignment trigger copy-on-write and do not propogate.
Chained assignments always work!
Even true with cross-sections that change dtype
except for a really egregious case, which will raise a
SettingWithCopyError
(maybe I could even fix this....)This is an invalid assignment. I suppose we could make it work, but we currently havent' allowed the
.dt
to be used as a setitem accessorcc @nickeubank
cc @JanSchulz
cc @ellisonbg
cc @CarstVaartjes
@shoyer
special thanks to @JanSchulz for some reference checking code :)