-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Possibly invalidate the item_cache when numpy implicty converts a v... #3977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@wesm this was pretty tricky....not 100% sure I am right, what do you think? |
Where precisely was the type conversion happening before that was leading to a stale cached Series in |
I don't think was any type conversion. As near as I can tell its somhow the even though I know we are not touching the memory on FYI - there is a method on BlockManager to do this in this PR, |
@wesm any thoughts on this? |
@wesm
This is effectively copy-on-write, but in order to actually implement this a series would have to know that its being cached (trivial) - but the invalidation needs to occur whenever you are changing something in a series, but without storing a reference to the cacher (the dataframe), this is very hard I think so just need to think about cases when you need to invalidate the cache in the frames This PR solves the current issues, but not sure if there are other related issues lurking |
@wesm ok..this solves the problem...keeping a weak ref to the cacher in the series (and invalidting when writing)...but take a took in any event |
…a view to a copy (GH3970) PERF: testing perf BUG: implement cache tracking with a weak reference
Can you please wait to merge this until I have time to look a bit closer? Will try to look today or tomorrow |
np |
This is more pernicious than I thought: on master:
going to dig in and figure out what in the hell is going on for my own sake |
Block consolidation is not invalidating the cache:
Fix coming shortly |
yep looks right.... |
closing in favor of #4077 |
closes #3970
df['bb'].iloc[0] = 0.13
here
bb
is put into the_item_cache
and the first element assigned 0.13;bb
is still a view onto the frame blockdf_tmp = df.iloc[[True]*len(df)]
df_tmp
is now a copy of df, but built up as a take block-by-block fromdf
. I believe numpy then decides to invalidate the views to the memory in the float block (why I have no idea);bb
indf
is still holding the view in the_item_cache
to the old memory location.df['bb'].iloc[0] = 0.15
grabs
bb
from the cache and assigns 0.15 to it, BUT it is not longer a view onto the float block indf
so nothing appears to get updatedI tried 2 fixes:
_item_cache
check if its a view of the underlying data - this works, but makes lookups way slowtaking
ON THE ORIGINAL frame, so future lookups will cache-miss and get the correct data from the block managerFundamentally we COULD do this in the first statement
df['bb'].iloc[0] = 0.13
, where we don't reset the cache, except we don't have a reference to df (well we do, but by the time the operation is carried out we have an operation on the series, so have lost the frame reference)So this is fixed in this case, but not sure what other numpy ops implicity convert view-> copy (and we are holding onto the cached view).