Skip to content

Don't use numpy broadcasting in guvectorize inner loop #348

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 26, 2020

Conversation

tomwhite
Copy link
Collaborator

Small but critical change to count_cohort_alleles that takes the runtime down from ~100 min (projected - not run to completion) on MalariaGEN data to ~8 min. Having explicit loops rather than relying on broadcasting in the inner loop seems to be the key here.

Copy link
Collaborator

@eric-czech eric-czech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 good to know.

@hammer
Copy link
Contributor

hammer commented Oct 23, 2020

🤔 that's a big enough jump for a basic enough pattern that it may be worth adding to our documentation some place?

@jeromekelleher
Copy link
Collaborator

Wow, yeah, that's really good to know.

@eric-czech
Copy link
Collaborator

How did you find that btw @tomwhite? Did something tip you off to that specific code being a performance problem?

@tomwhite
Copy link
Collaborator Author

that's a big enough jump for a basic enough pattern that it may be worth adding to our documentation some place?

Added to the contributors guide.

Did something tip you off to that specific code being a performance problem?

It wasn't obvious to me why that code should be an order of magnitude slower than count_alleles, so I compared the two to see if there was anything obviously different. I noticed that _count_cohort_alleles had a single loop, rather than a double loop as might be expected. I found this post, which has some good advice: "Instead of array operations, we are very explicit within the function and do everything with loops." So I tried that, and saw the speed increase.

Note that this doesn't mean we shouldn't ever use array operations within guvectorize - indeed, these functions fill the output array at the beginning with e.g. out[:, :] = 0 - but that array operations are generally not a good idea buried in the inner loop.

@tomwhite tomwhite added the auto-merge Auto merge label for mergify test flight label Oct 26, 2020
@mergify mergify bot merged commit 9427171 into sgkit-dev:master Oct 26, 2020
@timothymillar
Copy link
Collaborator

@tomwhite possibly related, I've always found a significant speed up when using numba for integer arithmetic by iterating over a range rather than the elements of an array.
For example:

@njit
def sum1(array):
    total = 0
    for i in array:
        total += i
    return total

is approximately 4-5x slower for an integer array than the following

@njit
def sum2(array):
    total = 0
    for i in range(len(array)):
        total += array[i]
    return total

My assumption here was that sum1 has more runtime overhead because i could be an integer or an ndarray where as in sum2 i is always an integer.
However the speedup seems to disappear for float addition so I'm no sure:

(numba 0.51.2)

>>> ints = np.random.randint(0,100,10_000)
>>> floats = ints.astype(np.float)
>>>
>>> %%timeit
... sum1(ints) 
4.95 µs ± 90.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %%timeit
... sum2(ints) 
1.11 µs ± 4.08 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> %%timeit
... sum1(floats) 
11.5 µs ± 8.07 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %%timeit
... sum2(floats) 
11.6 µs ± 15.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants