Skip to content

PERF: remove large-array-creating path in fast_xs #33032

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 29, 2020

Conversation

jbrockmendel
Copy link
Member

When frame.columns is non-unique, frame.iloc[n] goes through an unnecessary path that effectively creates frame.values and looking up [n] on that. That's a lot of casting to access just one row.

Luckily, that case is obsolete, so this rips it right out.

@jreback jreback added the Performance Memory or execution speed performance label Mar 26, 2020
@jreback
Copy link
Contributor

jreback commented Mar 27, 2020

this only hits the non-unique case I think. do we have any benchmarks?

@jbrockmendel
Copy link
Member Author

Just added a benchmark:

In [3]: arr = np.arange(10**7).reshape(-1, 10) 
In [4]: df = pd.DataFrame(arr)
In [5]: dtypes = ['u1', 'u2', 'u4', 'u8', 'i1', 'i2', 'i4', 'i8', 'f8', 'f4']                                                                              
In [6]: for i, d in enumerate(dtypes): 
   ...:         df[i] = df[i].astype(d) 

In [8]: %timeit df.iloc[10000]                                                                                                                             
126 µs ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)  # <-- both

In [9]: df.columns = ["A", "A"] + list(df.columns[2:])     
                                                                                                
In [11]: %timeit df.iloc[10000]                                                                                                                            
17.5 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   # <-- master
124 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)   # <-- PR

@jreback jreback added this to the 1.1 milestone Mar 29, 2020
@jreback jreback added the Indexing Related to indexing on series/frames, not to indexes themselves label Mar 29, 2020
@jreback jreback merged commit 99f2ccb into pandas-dev:master Mar 29, 2020
@jreback
Copy link
Contributor

jreback commented Mar 29, 2020

thanks

@jbrockmendel jbrockmendel deleted the perf-interleave branch March 29, 2020 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants