-
Notifications
You must be signed in to change notification settings - Fork 230
There exists significantly faster division algorithms for certain CPUs #265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This sounds pretty slick! We're always up for taking better algorithms here :) |
Is there a quick way to determine the performance difference between having or not having indexing checks (so that I do not have to replace all the |
Currently there's no build-time configuration for that, it needs to be changed in the code itself |
I just published version 0.0.4 of my crate, which I think will be the last one I will ever publish until someone else finds an issue with it. I inspected the assembly output and perf with and without unchecked division, and the performance difference was at most a few percent. I do not think I can improve it anymore except with SIMD, but even then the perf difference would only be at most a few percent for certain blocks. Additionally, in practice LLVM appears to inline my function properly whenever only the quotient or only the remainder is used, so I do not plan on adding any extra functions. |
I discovered that on x86_64 there is a |
What steps do I need to take if I want to see the Rust u/i128 divisions on x86_64 use my algorithm? |
The intrinsics called for u128 divisions (such as |
fixed by #332 |
Over the past few months I have been looking at different division algorithms, and I believe that this one is the fastest for 128 bit division on CPUs with a fast 64 bit hardware divider. I made a crate for it, specialized-div-rem. I initially thought that the crate would have more than one algorithm in it, but it turns out that the compiler likes inlining the appropriate branches in cases where its important.
The only thing I might add to it is a simple binary division algorithm for when the most significant bits of the dividend and divisor are <6 bits from each other and located in the higher 64 bits. I don't know if it is worth it to have more conditional branches for that use case, I think this algorithm has the fewest branches encountered by program flow of any algorithm out there.
I also expect that the 64 bit division algorithm in that crate is useful for 32 bit computers but I don't have one to test it out.
Sidenote that I also made a more general version for the
apint
crate but I need to put much more recursive work into that one before it becomes competitive.The text was updated successfully, but these errors were encountered: