-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Improve performance for df.__setitem__ with list-like indexers #38148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, simplification and pref boost
pandas/core/indexing.py
Outdated
self.obj[k] = value[i] | ||
else: | ||
self.obj[k] = value | ||
keys = self.obj.columns.tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you create the keys in the same order whether they are in the obj or not (e.g. combine L666 and 667)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a test for this ordering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Union should do the trick.
At least the test added with #37964 covers this
lgtm can merge on green. |
@jreback greenish. Failure unrelated |
thanks @phofl |
@phofl did you determine that this isn't making a copy? |
Not with a test, but I tested this in code.
This also changes x, so that is not a copy, isn't it? Should we add a test for this? |
Assuming homogeneous dtype for now, what im asking for is:
|
i am not sure this is possible w/o a copy |
This does make a copy. Did not know, that we have to compare values instead of the object itself. |
Actually this raised previously too, so at least not a regression |
Note: the example above I assumed homogeneous dtypes. More generally, we need to have
A few things need to happen for it to be possible:
|
@meeseeksdev backport 1.1.x |
This comment has been minimized.
This comment has been minimized.
…em__ with list-like indexers
As @jbrockmendel noted, this can change the copy/view semantics in certain cases. One example: df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 4, 6]})
# get one column as a view of df
s = df['a']
# add columns with list-like indexer
df[['c', 'd']] = np.array([[.1, .2], [.3, .4], [.4, .5]])
# edit in place the first column to check view semantics
df.iloc[0, 0] = 100 on master this gives:
where "a" and "b" columns were now copied in the While on pandas 1.1.4, this copy didn't happen, and the series was actually updated:
This of course closely relates to the recent discussions about improving the copy/view semantics (as currently those semantics are both not clear and largely untested). (now the above is certainly a specific example to trigger the issue. In many cases we would also do a copy, eg if there would already have been float columns present in the example |
what if instead of calling reindex_axis we called reindex_indexer and passed |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Reindexing the Block Manager improves the performance significantly. I hope I have not missed anything, concerning the reindexing of the blocks.
Time spent in
_ensure_listlike_indexer
is pretty low now.timeit result for the ops methods:
Should we add tests here? I have added an asv to capture this case.
cc @jbrockmendel