Skip to content

BUG: Possibly invalidate the item_cache when numpy implicty converts a v... #3977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Jun 21, 2013

closes #3970

df['bb'].iloc[0] = 0.13
here bb is put into the _item_cache and the first element assigned 0.13; bb is still a view onto the frame block

df_tmp = df.iloc[[True]*len(df)]
df_tmp is now a copy of df, but built up as a take block-by-block from df. I believe numpy then decides to invalidate the views to the memory in the float block (why I have no idea); bb in df is still holding the view in the _item_cache to the old memory location.

df['bb'].iloc[0] = 0.15
grabs bb from the cache and assigns 0.15 to it, BUT it is not longer a view onto the float block in df so nothing appears to get updated

I tried 2 fixes:

  1. when getting an item from the _item_cache check if its a view of the underlying data - this works, but makes lookups way slow
  2. clear the cache when taking ON THE ORIGINAL frame, so future lookups will cache-miss and get the correct data from the block manager

Fundamentally we COULD do this in the first statement df['bb'].iloc[0] = 0.13, where we don't reset the cache, except we don't have a reference to df (well we do, but by the time the operation is carried out we have an operation on the series, so have lost the frame reference)

So this is fixed in this case, but not sure what other numpy ops implicity convert view-> copy (and we are holding onto the cached view).

@jreback
Copy link
Contributor Author

jreback commented Jun 21, 2013

@wesm this was pretty tricky....not 100% sure I am right, what do you think?

@wesm
Copy link
Member

wesm commented Jun 21, 2013

Where precisely was the type conversion happening before that was leading to a stale cached Series in _item_cache?

@jreback
Copy link
Contributor Author

jreback commented Jun 21, 2013

I don't think was any type conversion. As near as I can tell its somhow the take, doesn't make sense
If I test for a view e.g. ((id(df['bb'].values.base) == id(df._data.blocks[0].values)) before and after this line:
df_tmp = df.iloc[[True]*len(df)], its True before, and False afterwards...

even though I know we are not touching the memory on df at all (though it IS passed to a cython function for taking), so maybe it assumes we are touching it somehow?

FYI - there is a method on BlockManager to do this in this PR, is_view

@jreback
Copy link
Contributor Author

jreback commented Jun 23, 2013

@wesm any thoughts on this?

@jreback
Copy link
Contributor Author

jreback commented Jun 26, 2013

@wesm
another thought here....

df['bb'].iloc[0] = 0.15 is the fundamental issue here
the Frame has no idea that the object is being altered (which if it knew could invalidate its cache and solve the problem)

This is effectively copy-on-write, but in order to actually implement this a series would have to know that its being cached (trivial) - but the invalidation needs to occur whenever you are changing something in a series, but without storing a reference to the cacher (the dataframe), this is very hard I think

so just need to think about cases when you need to invalidate the cache in the frames

This PR solves the current issues, but not sure if there are other related issues lurking

@jreback
Copy link
Contributor Author

jreback commented Jun 27, 2013

@wesm ok..this solves the problem...keeping a weak ref to the cacher in the series (and invalidting when writing)...but take a took in any event

…a view to a copy (GH3970)

PERF: testing perf

BUG: implement cache tracking with a weak reference
@wesm
Copy link
Member

wesm commented Jun 27, 2013

Can you please wait to merge this until I have time to look a bit closer? Will try to look today or tomorrow

@jreback
Copy link
Contributor Author

jreback commented Jun 27, 2013

np

@wesm
Copy link
Member

wesm commented Jun 28, 2013

This is more pernicious than I thought:

on master:

In [1]: paste
df = DataFrame({ "aa":range(5), "bb":[2.2]*5})
df["cc"] = 0.0
ck = [True]*len(df)
df["bb"].loc[0] = .13 # works
## -- End pasted text --

In [2]: df
Out[2]:
   aa    bb  cc
0   0  0.13   0
1   1  2.20   0
2   2  2.20   0
3   3  2.20   0
4   4  2.20   0

In [3]: df['bb'].iloc[0] = 0.17

In [4]: df
Out[4]:
   aa    bb  cc
0   0  0.13   0
1   1  2.20   0
2   2  2.20   0
3   3  2.20   0
4   4  2.20   0

going to dig in and figure out what in the hell is going on for my own sake

@wesm
Copy link
Member

wesm commented Jun 28, 2013

Block consolidation is not invalidating the cache:

In [19]: paste
df = DataFrame({ "aa":range(5), "bb":[2.2]*5})
df["cc"] = 0.0
ck = [True]*len(df)
print df._data
df["bb"].loc[0] = .13 # works
## -- End pasted text --
BlockManager
Items: Index([u'aa', u'bb', u'cc'], dtype=object)
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype=int64)
FloatBlock: [bb], 1 x 5, dtype float64
IntBlock: [aa], 1 x 5, dtype int64
FloatBlock: [cc], 1 x 5, dtype float64

In [20]: df
Out[20]:
   aa    bb  cc
0   0  0.13   0
1   1  2.20   0
2   2  2.20   0
3   3  2.20   0
4   4  2.20   0

In [21]: df["bb"].iloc[0] = .17

In [22]: df
Out[22]:
   aa    bb  cc
0   0  0.13   0
1   1  2.20   0
2   2  2.20   0
3   3  2.20   0
4   4  2.20   0

In [23]: df._data
Out[23]:
BlockManager
Items: Index([u'aa', u'bb', u'cc'], dtype=object)
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype=int64)
FloatBlock: [bb, cc], 2 x 5, dtype float64
IntBlock: [aa], 1 x 5, dtype int64

Fix coming shortly

@jreback
Copy link
Contributor Author

jreback commented Jun 28, 2013

yep looks right....

@wesm
Copy link
Member

wesm commented Jun 28, 2013

closing in favor of #4077

@wesm wesm closed this Jun 28, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Strange behavior assigning values to elements
2 participants