Skip to content

Vectorize base32 decoding #28

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
k0ekk0ek opened this issue Feb 20, 2023 · 30 comments · Fixed by #82
Closed

Vectorize base32 decoding #28

k0ekk0ek opened this issue Feb 20, 2023 · 30 comments · Fixed by #82
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Milestone

Comments

@k0ekk0ek
Copy link
Contributor

Base32 encoding is used in NSEC3 records (RFC5155) to hash the next owner name and may appear quite a lot in DNSSEC signed zones (not the case for .se, but is the case for e.g. .com). Like base16 and base64, this encoding may be eligible for vectorization too.

@k0ekk0ek k0ekk0ek added enhancement New feature or request help wanted Extra attention is needed labels Feb 20, 2023
@k0ekk0ek k0ekk0ek added this to the Release 0.2.0 milestone Feb 20, 2023
@k0ekk0ek k0ekk0ek added the good first issue Good for newcomers label Jun 20, 2023
@k0ekk0ek
Copy link
Contributor Author

Wojciech Muła and Daniel Lemire wrote the paper "Faster Base64 Encoding and Decoding Using AVX2 Instructions". Some of the logic can be used to decode Base32 too. A first stab may look something like this:

#define s0to9 (-128 + '0')
#define sAtoV (-128 + 'A' + 9)
#define satov (-128 + 'a' + 9)

#define r0to9 (-128 + ('9' - '0' + 1))
#define rAtoV (-128 + ('V' - 'A' + (1 + 9)))
#define ratov (-128 + ('v' - 'a' + (1 + 9)))

static void decode(const char *encoded)
{
  __m256i input = _mm256_loadu_si256((const __m256i*)encoded);

  __m256i subtract = setr_epi8(
    0, 0, 0, s0to9, sAtoV, sAtoV, satov, satov,
    0, 0, 0, 0, 0, 0, 0, 0, 0);

  __m256i compare = setr_epi8(
    -128, -128, -128, r0to9, sAtoV, sAtoV, satov, satov,
    -128, -128, -128, -128, -128, -128, -128, -128);

  // determine shuffle mask based on hi nibbles
  __m256i nibbles = _mm256_srli_epi32(input, 4);
  nibbles = _mm256_and_si256(nibbles, _mm256_set1_epi8(0x0f));

  subtract = _mm256_shuffle_epi8(subtract, nibbles);
  compare = _mm256_shuffle_epi8(compare, nibbles);

  __m256i d = _mm256_sub_epi8(input, subtract); 
  const __m256i mask = _mm_cmplt_epi8(digits, compare); 

  d = _mm256_and_si256(d, mask);

  // (pretend the above works flawlessly)

  // left shift can be expressed as multiply by power of two
  //  
  // SHL    MUL(16)    MUL(10)
  //  0  :  0x0001  :     1
  //  1  :  0x0002  :     2
  //  2  :  0x0004  :     4
  //  3  :  0x0008  :     8
  //  4  :  0x0010  :    16
  //  5  :  0x0020  :    32
  //  6  :  0x0040  :    64
  //  7  :  0x0080  :   128
  //  8  :  0x0100  :   256
  //  9  :  0x0200  :   512
  // 10  :  0x0400  :  1024

  // pack into 10-bit words
  const __m128i d16 = _mm_addubs_epi16(d, _mm_set1_epi32(0x01080108));

  // pack into 20-bit words
  const __m128i d32 = _mm_madd_epi16(d16, _mm_set1_epi32(0x00010400));

  //  
  // base64 decoder now uses a shuffle, but that is not possible for base32 as
  // byte boundaries span epi32 (only 20-bit words)
  //  
  // shifting individual epi32 into location (_mm_slli_epi64) and adding them
  // afterward will get two 40-bit words, which means proper byte boundaries.
  //  

  // implement
}

However, with Base32 after packing up to epi32, we have 20-bit words. As such, the shuffle operation cannot be done right away because of the byte boundary (we miss 4 bits). Of course, this can be resolved by shifting and then adding two epi32 together which gets us 2 40-bit words, where the byte boundary is convenient again.

@lemire
Copy link
Collaborator

lemire commented Jun 20, 2023

Same question here: are spaces allowed within the codes (e.g., line returns).

@k0ekk0ek
Copy link
Contributor Author

No, Base32 is only used in NSEC3 (presentation format). The data there is not the last field and hence must be presented as one contiguous set of characters.

Example from Appendix A in RFC5155:

  NSEC3   1 1 12 aabbccdd (
                            2vptu5timamqttgl4luu9kg21e0aor3s A RRSIG )

@aqrit
Copy link

aqrit commented Jun 24, 2023

pretend the above works flawlessly

The input range can be hashed/sliced into spans of 8 instead spans of 16.
However, that necessitates a fixup for signed chars.

const __m128i delta_check = _mm_setr_epi8(-128, -128, -128, -128, -128, -128, -50, -128, -65, 0, 0, 0x7F - 0x5A, -97, 0, 0, 0x7F - 0x7A);
const __m128i delta_rebase = _mm_setr_epi8(0, 0, 0, 0, 0, 0, -24, 0, -65, -65, -65, -65, -97, -97, -97, -97);

__m128i v = _mm_loadu_si128((__m128i*)src);
__m128i hash_key = _mm_and_si128(_mm_srli_epi32(v, 3), _mm_set1_epi8(0x0F));

// unpacked binary-coded-duotrigesimal
__m128i data = _mm_add_epi8(_mm_shuffle_epi8(delta_rebase, hash_key), v);

// bad char check
__m128i check = _mm_add_epi8(_mm_shuffle_epi8(delta_check, hash_key), v);
check = _mm_or_si128(check, v); // fixup signed chars
if(_mm_movemask_epi8(check)) {
	; // bad char detected
}

// todo: pack bits

edit: it looks like "base32hex" is required but the same approach would still apply.
Though, maybe a better job can be done with the different characters.

@lemire
Copy link
Collaborator

lemire commented Jun 25, 2023

I will have a look this week.

@aqrit
Copy link

aqrit commented Jul 13, 2023

Should we know how many base32hex digits to expect before starting decoding?

Anyways, decoding seems fairly uninteresting.
Here is a bare kernel:
https://gist.github.com/aqrit/ee41f2fe222b3cb68aaf7facd5cd08a6

@k0ekk0ek
Copy link
Contributor Author

Thank you @aqrit! Looks pretty straightforward indeed. I'll play with it after the weekend and go from there. Much appreciated! @lemire, are you interested in reviewing and benchmarking too? The current code is terrible, so that might make for a fun write-up as well? 🙂

@lemire
Copy link
Collaborator

lemire commented Jul 14, 2023

Yes. It is on my todo as well.

@k0ekk0ek
Copy link
Contributor Author

@aqrit, sorry, totally forgot to answer your question. We can know the base32hex digits beforehand. Currently, it's only used for NSEC3 (RFC5155) which means the encoded string is a fixed size hash. However, new algorithms may be added, so I don't know if it's wise to do anything with it beforehand. I think it's better to just decode and require the proper amount of padding. Then verify it's the correct length for known algorithms afterwards?

Starting to look at your code now btw.

@k0ekk0ek
Copy link
Contributor Author

Only looked at the simd variant (yet), but nice work @aqrit! Should make for a nice speedup (.com zone file contains it). Only thing we have to add is checking for correct padding, which is easy to add (especially because we don't have to account for whitespace). Thanks!

I especially like your range check, pretty clever.

(of course, I'll look at the swar version for the fallback parser too)

@lemire
Copy link
Collaborator

lemire commented Jul 19, 2023

Ok. Let me build this up a big and see what we can make out of it... :-)

@lemire
Copy link
Collaborator

lemire commented Jul 19, 2023

I hope to have something later today.

@lemire
Copy link
Collaborator

lemire commented Jul 20, 2023

Ok. So in my tests, SWAR is effectively useless. The scalar approach is faster. I have even have a slightly faster one (base32hex_fast) that can reach 2.5 GB/s (about 10% faster than the routine/scalar approach). The SIMD approach is about 2.5 times faster in my tests. You can gain a bit of speed by using 256-bit vectors (AVX) but not that much. All the functions can slightly overwrite to the output buffer and overread in the input buffer.

For my benchmark, I use short inputs ("F1S6QOJADHQMKS3GCLIN4RB9F1Q6UT37") and there is no inlining, which means that each call has to pay for register initialization and all that, which explains the lack of power of the wide SIMD approach.

base32hex_avx                  :   7.55 GB/s  235.9 Ma/s   4.24 ns/d   3.23 GHz  13.70 c/d  61.21 i/d    0.4 c/b   1.91 i/b   4.47 i/c 
base32hex_simd                 :   7.02 GB/s  219.3 Ma/s   4.56 ns/d   3.23 GHz  14.72 c/d  70.21 i/d    0.5 c/b   2.19 i/b   4.77 i/c 
base32hex_fast                 :   2.53 GB/s   79.0 Ma/s  12.65 ns/d   3.21 GHz  40.57 c/d  194.21 i/d    1.3 c/b   6.07 i/b   4.79 i/c 
base32hex_simple               :   2.35 GB/s   73.4 Ma/s  13.62 ns/d   3.21 GHz  43.69 c/d  231.21 i/d    1.4 c/b   7.23 i/b   5.29 i/c 
base32hex_swar                 :   1.98 GB/s   61.7 Ma/s  16.20 ns/d   3.20 GHz  51.85 c/d  230.21 i/d    1.6 c/b   7.19 i/b   4.44 i/c 

As before, I have made my code available...
https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2023/07/19

They are all maxing out the "instructions per cycle". So it is an instance where lowering the number of instructions is critical. Sadly, this might be a tad difficult for short inputs.

If you know that the size of the input (@aqrit raised this point), you can gain a bit of speed. I suspect it is not that much...

@k0ekk0ek
Copy link
Contributor Author

Good stuff @lemire. Good to have this documented, don't know if this is used in many other places, but people can at least find it on the internet now.

Can you include this version too? It doesn't make sense from a performance perspective, but it nicely shows where we started from(?)

We also have to check for padding(?) We can use the zero_mask trick but with equals signs instead. Modulo with 8, pick the right mask and compare?

@lemire
Copy link
Collaborator

lemire commented Jul 20, 2023

We also have to check for padding(?)

I followed @aqrit's idea and we don't really validate the padding. Basically, you can stop the stream and put any garbage (not just =) and it will work. (It should work.)

But sure, we can try to check that we have =, as you describe.

@aqrit
Copy link

aqrit commented Jul 20, 2023

Base32: the "Base 32 Encoding with Extended Hex Alphabet" as
specified in [RFC4648]. Note that trailing padding characters
("=") are not used in the NSEC3 specification.

The Next Hashed Owner Name field is represented as an unpadded
sequence of case-insensitive base32 digits, without whitespace.

Is support for padding needed?

base32hex_fast

IMO, it is not worth pulling that much data into the cache.
However, the data in the table could be bswapped to eliminate the bswap in the code.
Bad_char entries could be a single set bit in the hi_byte, then only one branch for bad chars would be required..
though we'd still need to zero the bytes after the first bad char... because some bytes are made from 3 input chars.

@lemire
Copy link
Collaborator

lemire commented Jul 20, 2023

@aqrit

IMO, it is not worth pulling that much data into the cache.

I agree. This is research.

@lemire
Copy link
Collaborator

lemire commented Jul 20, 2023

I started implementing the padding check, and almost got it done, but after reading @aqrit 's comment, I am pulling out.

Because the functions return the number of bytes read, it is always easy to check the padding if you want to. All you need is a single loop when the number of bytes read is not a multiple of eight.

@lemire
Copy link
Collaborator

lemire commented Jul 20, 2023

@aqrit I have implemented your optimizations but it hardly makes a difference performance-wise. :-)

However the code is prettier.

@lemire
Copy link
Collaborator

lemire commented Jul 20, 2023

Ok. Here is what I get...

base32hex_avx                  :   8.02 GB/s  250.6 Ma/s   3.99 ns/d   3.24 GHz  12.91 c/d  61.21 i/d    0.4 c/b   1.91 i/b   4.74 i/c 
base32hex_simd                 :   6.95 GB/s  217.2 Ma/s   4.60 ns/d   3.23 GHz  14.86 c/d  70.21 i/d    0.5 c/b   2.19 i/b   4.72 i/c 
base32hex_fast                 :   2.54 GB/s   79.4 Ma/s  12.60 ns/d   3.21 GHz  40.39 c/d  119.21 i/d    1.3 c/b   3.73 i/b   2.95 i/c 
base32hex_simple               :   2.34 GB/s   73.0 Ma/s  13.70 ns/d   3.20 GHz  43.90 c/d  231.21 i/d    1.4 c/b   7.23 i/b   5.27 i/c 
base32hex_swar                 :   1.99 GB/s   62.2 Ma/s  16.08 ns/d   3.20 GHz  51.51 c/d  230.21 i/d    1.6 c/b   7.19 i/b   4.47 i/c 
b32_pton                       :   0.21 GB/s    6.5 Ma/s  153.07 ns/d   3.19 GHz  488.84 c/d  1429.90 i/d   15.3 c/b  44.68 i/b   2.93 i/c 

I am moving the URL to Code-used-on-Daniel-Lemire-s-blog/2023/07/20 from Code-used-on-Daniel-Lemire-s-blog/2023/07/20 and pushing a blog post.

@lemire
Copy link
Collaborator

lemire commented Jul 21, 2023

Blog post:

Fast decoding of base32 strings

@k0ekk0ek
Copy link
Contributor Author

I started implementing the padding check, and almost got it done, but after reading @aqrit 's comment, I am pulling out.

Because the functions return the number of bytes read, it is always easy to check the padding if you want to. All you need is a single loop when the number of bytes read is not a multiple of eight.

Apparently, we don't even want to support padding 😅. RFC5155 states:

The Next Hashed Owner Name field is represented as an unpadded sequence of case-insensitive base32 digits, without whitespace.

And indeed, the .com zones I have laying around contain no padding. So in the PR, checking if the field is followed by a character not in the contiguous set is sufficient.

@k0ekk0ek
Copy link
Contributor Author

That's some great numbers @lemire! Thank you both!

@lemire
Copy link
Collaborator

lemire commented Jul 21, 2023

The current code would work even if there is padding. It would simply be lenient about it (e.g., it would accept any character as padding, not just '=').

However, I should note that adding such a check (that the padding is done with '=') would be cheap... I considered adding it as an option (check_padding=true/false), but why make the function more complicated than it needs to be?

@k0ekk0ek
Copy link
Contributor Author

The code in NSD doesn't account for anything but valid characters and that code has been in use for a long time. I think it's better not to allow for padding and strictly follow the RFC. If there's a valid use case we can always add it afterwards(?)

@lemire
Copy link
Collaborator

lemire commented Jul 23, 2023

@k0ekk0ek I think that's what the code we have written does. What I mean by lenient is that as soon as invalid characters are found, you consider that you have terminated the base32 sequence.

Of course, you could have something like....

F1S6QOJADHQMKS3GCLIN4RB9F1Q6UT37+8***fds

What is would do is that it would stop at the +.

@k0ekk0ek
Copy link
Contributor Author

Sounds good to me 👍

@k0ekk0ek
Copy link
Contributor Author

@lemire, are you working on a PR? No pressure, just want to know if you want to do the honors or if you want me to integrate the changes 🙂

@lemire
Copy link
Collaborator

lemire commented Jul 25, 2023

I can prepare a PR, yes.

@lemire
Copy link
Collaborator

lemire commented Jul 26, 2023

PR available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants