-
Notifications
You must be signed in to change notification settings - Fork 35
Don't use numpy broadcasting in guvectorize inner loop #348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't use numpy broadcasting in guvectorize inner loop #348
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 good to know.
🤔 that's a big enough jump for a basic enough pattern that it may be worth adding to our documentation some place? |
Wow, yeah, that's really good to know. |
How did you find that btw @tomwhite? Did something tip you off to that specific code being a performance problem? |
Added to the contributors guide.
It wasn't obvious to me why that code should be an order of magnitude slower than Note that this doesn't mean we shouldn't ever use array operations within |
@tomwhite possibly related, I've always found a significant speed up when using numba for integer arithmetic by iterating over a range rather than the elements of an array. @njit
def sum1(array):
total = 0
for i in array:
total += i
return total is approximately 4-5x slower for an integer array than the following @njit
def sum2(array):
total = 0
for i in range(len(array)):
total += array[i]
return total My assumption here was that (numba 0.51.2) >>> ints = np.random.randint(0,100,10_000)
>>> floats = ints.astype(np.float)
>>>
>>> %%timeit
... sum1(ints)
4.95 µs ± 90.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %%timeit
... sum2(ints)
1.11 µs ± 4.08 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> %%timeit
... sum1(floats)
11.5 µs ± 8.07 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %%timeit
... sum2(floats)
11.6 µs ± 15.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) |
Small but critical change to
count_cohort_alleles
that takes the runtime down from ~100 min (projected - not run to completion) on MalariaGEN data to ~8 min. Having explicit loops rather than relying on broadcasting in the inner loop seems to be the key here.