Skip to content

Performance of sum vs mean on Bool arrays is 10x different #19133

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stakas opened this issue Jan 8, 2018 · 5 comments
Closed

Performance of sum vs mean on Bool arrays is 10x different #19133

stakas opened this issue Jan 8, 2018 · 5 comments
Labels
Closing Candidate May be closeable, needs more eyeballs Performance Memory or execution speed performance Reduction Operations sum, mean, min, max, etc.

Comments

@stakas
Copy link

stakas commented Jan 8, 2018

Code Sample

K = 100000000
df = pd.DataFrame(list(range(K)))
mask = df[0] > K/2

%timeit mask.mean()

%timeit mask.sum()

Problem description

Doing "sum" and "mean" on boolean pandas masks is 10x different! This clearly should not be the case, given these are identical operations.

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 33.1.1
Cython: 0.25.2
numpy: 1.12.0
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: 2.7.3 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

Could you post the timings you get? And can you try both with / without bottleneck? mean should use it, but sum won't IIRC.

If that doesn't explain the difference, then when you get a chance, could you upgrade to pandas master and maybe profile https://github.com/pandas-dev/pandas/blob/master/pandas/core/nanops.py a bit?

@TomAugspurger TomAugspurger added Performance Memory or execution speed performance Numeric Operations Arithmetic, Comparison, and Logical operations labels Jan 8, 2018
@stakas
Copy link
Author

stakas commented Jan 8, 2018

Spot on!
For

K = 100000000
df = pd.DataFrame(list(range(K)))
mask = df[0] > K/2

Without bottleneck: %timeit mask.mean() gives: 10 loops, best of 3: 141 ms per loop and python %timeit mask.sum() gives: python 1 loop, best of 3: 1.06 s per loop .

With bottleneck timings are much more inline: python 10 loops, best of 3: 137 ms per loop
vs python 1 loop, best of 3: 161 ms per loop .

So the solution would be to use Bottleneck ? Thanks

@TomAugspurger
Copy link
Contributor

Thanks.

IIRC, bottleneck.nansum had some overflow issues. see #15507. This may be easier to solve now that #9422 has been settled.

@jreback jreback added this to the Next Major Release milestone Jan 9, 2018
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@jbrockmendel
Copy link
Member

im seeing indistinguishable timings both with and without bottleneck. this is on a pre-M1 mac. could use confirmation.

@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Mar 2, 2023
@jbrockmendel jbrockmendel added Reduction Operations sum, mean, min, max, etc. and removed Numeric Operations Arithmetic, Comparison, and Logical operations labels Mar 30, 2023
@mroeschke
Copy link
Member

Same here, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Performance Memory or execution speed performance Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

No branches or pull requests

5 participants