-
-
Notifications
You must be signed in to change notification settings - Fork 21
Vectorize base32 decoding #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Wojciech Muła and Daniel Lemire wrote the paper "Faster Base64 Encoding and Decoding Using AVX2 Instructions". Some of the logic can be used to decode Base32 too. A first stab may look something like this:
However, with Base32 after packing up to epi32, we have 20-bit words. As such, the shuffle operation cannot be done right away because of the byte boundary (we miss 4 bits). Of course, this can be resolved by shifting and then adding two epi32 together which gets us 2 40-bit words, where the byte boundary is convenient again. |
Same question here: are spaces allowed within the codes (e.g., line returns). |
No, Base32 is only used in NSEC3 (presentation format). The data there is not the last field and hence must be presented as one contiguous set of characters. Example from Appendix A in RFC5155:
|
The input range can be hashed/sliced into spans of 8 instead spans of 16.
edit: it looks like "base32hex" is required but the same approach would still apply. |
I will have a look this week. |
Should we know how many base32hex digits to expect before starting decoding? Anyways, decoding seems fairly uninteresting. |
Yes. It is on my todo as well. |
@aqrit, sorry, totally forgot to answer your question. We can know the base32hex digits beforehand. Currently, it's only used for NSEC3 (RFC5155) which means the encoded string is a fixed size hash. However, new algorithms may be added, so I don't know if it's wise to do anything with it beforehand. I think it's better to just decode and require the proper amount of padding. Then verify it's the correct length for known algorithms afterwards? Starting to look at your code now btw. |
Only looked at the simd variant (yet), but nice work @aqrit! Should make for a nice speedup (.com zone file contains it). Only thing we have to add is checking for correct padding, which is easy to add (especially because we don't have to account for whitespace). Thanks! I especially like your range check, pretty clever. (of course, I'll look at the swar version for the fallback parser too) |
Ok. Let me build this up a big and see what we can make out of it... :-) |
I hope to have something later today. |
Ok. So in my tests, SWAR is effectively useless. The scalar approach is faster. I have even have a slightly faster one (base32hex_fast) that can reach 2.5 GB/s (about 10% faster than the routine/scalar approach). The SIMD approach is about 2.5 times faster in my tests. You can gain a bit of speed by using 256-bit vectors (AVX) but not that much. All the functions can slightly overwrite to the output buffer and overread in the input buffer. For my benchmark, I use short inputs ("F1S6QOJADHQMKS3GCLIN4RB9F1Q6UT37") and there is no inlining, which means that each call has to pay for register initialization and all that, which explains the lack of power of the wide SIMD approach.
As before, I have made my code available... They are all maxing out the "instructions per cycle". So it is an instance where lowering the number of instructions is critical. Sadly, this might be a tad difficult for short inputs. If you know that the size of the input (@aqrit raised this point), you can gain a bit of speed. I suspect it is not that much... |
Good stuff @lemire. Good to have this documented, don't know if this is used in many other places, but people can at least find it on the internet now. Can you include this version too? It doesn't make sense from a performance perspective, but it nicely shows where we started from(?) We also have to check for padding(?) We can use the zero_mask trick but with equals signs instead. Modulo with 8, pick the right mask and compare? |
I followed @aqrit's idea and we don't really validate the padding. Basically, you can stop the stream and put any garbage (not just But sure, we can try to check that we have |
Is support for padding needed?
IMO, it is not worth pulling that much data into the cache. |
I agree. This is research. |
I started implementing the padding check, and almost got it done, but after reading @aqrit 's comment, I am pulling out. Because the functions return the number of bytes read, it is always easy to check the padding if you want to. All you need is a single loop when the number of bytes read is not a multiple of eight. |
@aqrit I have implemented your optimizations but it hardly makes a difference performance-wise. :-) However the code is prettier. |
Ok. Here is what I get...
I am moving the URL to Code-used-on-Daniel-Lemire-s-blog/2023/07/20 from Code-used-on-Daniel-Lemire-s-blog/2023/07/20 and pushing a blog post. |
Blog post: |
Apparently, we don't even want to support padding 😅. RFC5155 states:
And indeed, the |
That's some great numbers @lemire! Thank you both! |
The current code would work even if there is padding. It would simply be lenient about it (e.g., it would accept any character as padding, not just '='). However, I should note that adding such a check (that the padding is done with '=') would be cheap... I considered adding it as an option ( |
The code in NSD doesn't account for anything but valid characters and that code has been in use for a long time. I think it's better not to allow for padding and strictly follow the RFC. If there's a valid use case we can always add it afterwards(?) |
@k0ekk0ek I think that's what the code we have written does. What I mean by lenient is that as soon as invalid characters are found, you consider that you have terminated the base32 sequence. Of course, you could have something like....
What is would do is that it would stop at the |
Sounds good to me 👍 |
@lemire, are you working on a PR? No pressure, just want to know if you want to do the honors or if you want me to integrate the changes 🙂 |
I can prepare a PR, yes. |
PR available. |
Base32 encoding is used in
NSEC3
records (RFC5155) to hash the next owner name and may appear quite a lot in DNSSEC signed zones (not the case for.se
, but is the case for e.g..com
). Like base16 and base64, this encoding may be eligible for vectorization too.The text was updated successfully, but these errors were encountered: