Skip to content

Conversation

@lemire
Copy link
Collaborator

@lemire lemire commented Jun 9, 2023

This adds the IPv4 parser from https://lemire.me/blog/2023/06/08/parsing-ip-addresses-crazily-fast/ to the project.

This was neither benchmarked nor tested in simdzone, though I have my own tests, some of them contributed by @aqrit

Fixes #23

@lemire lemire requested a review from k0ekk0ek June 9, 2023 16:23
@aqrit
Copy link

aqrit commented Jun 11, 2023

A quick test seems to show it not rejecting a number of invalid inputs (e.g. "111.1111.1.111" ).
https://gist.github.com/aqrit/8c5b25f079508d5417080d048a097b9e#file-test_sse_inet_aton-c
Note: Some of those errors are duplicates. Not all errors are caught (e.g. when a shuffle would grab bytes beyond strlen).

The problem is: The hash entry expects a certain dotmask but never actually verifies that it has it.

Possible solution:
To avoid another table...
The shuffle control mask has three unused bits in each byte (that are ignored by pshufb).
The expected positions of the dots_char could be marked in the pattern by setting bit6 of that byte.
Then the positions could be extracted using paddb x,x and pmovmskb x
Which could be compared against the actual dotmask that was fed into the perfect hash function.

Copy link
Contributor

@k0ekk0ek k0ekk0ek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice @lemire. Thank you! I prefer we get the build errors out of the way though. Tests will not be required as I haven't added proper ones for my code, but I really prefer check marks are green before merging. Also, we should have a look at @aqrit's comment (thank you, @aqrit). I'm definitely willing to write a couple of tests for that afterwards to avoid regressions, but probably best to tackle the issue in the algorithm before merging? As for myself, I prefer we drop the use of token->length. I'll accept the code without that change and revisit it as part of my own PR, but this is as good a time as any to at least discuss it?

SEMANTIC_ERROR(parser, "Invalid %s in %s",
field->name.data, type->name.data);
// Note that this assumes that reading up to token->data + 16 is safe (i.e., we do not cross a page).
if (sse_inet_aton(token->data, token->length, &parser->rdata->octets[parser->rdata->length]) != 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

token->length will not be available after #64 is merged. Preferably, we account for that at this stage (we can worry about it later too, but your knowledge of the algorithm is still fresh now?). In an attempt to vectorize timestamp deserialization (#22, still WiP), I opted to use the result of the subtract operation to detect where the IP address ends. The maximum string size is 16 bytes (INET_ADDRSTRLEN), we can use that to get the dot mask instead of ipv4_string_length(?), then later on use the result of the subtract operation combined with the found dots to determine what should be kept? Then we can verify with the length wether or not the character that follows the address is a valid delimiter. At least, that's what I think is a better approach than keeping the length with the token... (as always, this is open for suggestions/discussion)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a closer look at the code (there may be things I did not account for, and I'm still playing with it, but I wanted to share):

  const __m128i ascii0_9 = _mm_setr_epi8(
    '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 0, 0, 0, 0, 0, 0);
  const __m128i digits = _mm_cmpeq_epi8(input, _mm_shuffle_epi8(ascii0_9, input));

  // locate dots
  uint16_t dotmask;
  { 
    const __m128i dot = _mm_set1_epi8('.');
    dotmask = (uint16_t)_mm_movemask_epi8(_mm_cmpeq_epi8(input, dot));
    const uint16_t digit_mask = (uint16_t)_mm_movemask_epi8(digits);
    uint16_t mask = ~(digit_mask | dotmask);
    mask &= -mask;
    dotmask &= mask - 1;
    dotmask |= mask;
    // FIXME: testing for weird input can now be done using masks but perhaps we can
    //        catch it when determining hashcode too?
  }

We can then do the subtract before the shuffle:

  const __m128i t0 = _mm_and_si128(_mm_sub_epi8(input, ascii0), digits);

But that does allow us to drop the blend and the range check?

Copy link

@aqrit aqrit Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that does allow us to drop the blend and the range check?

It looks like it does, since it slices the input at the first bad char.

Here is a untested version with different bit twiddling:

uint16_t m = digit_mask | dotmask;
m ^= (m + 1); // mask of lowest clear bit and below
dotmask = ~digitmask & m;
// string_length = popcnt or lzcnt   

Comment on lines +152 to +161
// check that leading digits of 2- 3- numbers are not zeros.
{
const __m128i eq0 = _mm_cmpeq_epi8(t1, ascii0);
if (!_mm_testz_si128(eq0, _mm_set_epi8(-1, 0, -1, 0, -1, 0, -1, 0,
0, 0, 0, 0, 0, 0, 0, 0))) {
return 0;
}
}
// replace null values with '0'
__m128i t1b = _mm_blendv_epi8(t1, ascii0, pattern);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this can be simplified. Next to the pattern, we could use an AND-mask and save that with the shuffle pattern? So, _mm_sub_epi8 then and with the mask (I'm assuming we need it to zero out bytes that weren't actually set in the textual presentation) and then simply do a _mm_cmpgt_epi8(0x09) (instead of _mm_max_epu8 + _mm_cmpeq_epi8 + _mm_test_all_ones). This may be a completely bogus idea (I should really study the algorithm better, I'll do so today or tomorrow, but it's something that popped into my head...)

Copy link

@aqrit aqrit Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "check that everything was in the range '0' to '9'" check could be combined with with the overflow check.

_mm_maddubs_epi16 sign extends, so the trick is to set bad chars to 0x80 or some other value (or range) that would work. Then use _mm_adds_epu16() to combine the high and low 64-bit lanes.

edit: but that might put this check on the critical path

Note: the intrinsic guide shows _mm_test_all_ones() as a sequence unlike all the other PTEST mnemonics.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@k0ekk0ek An alternative, if you don't like _mm_max_epu8 is to do...

    const __m128i t2z = _mm_add_epi8(t2,_mm_set1_epi8(-128) );
    const __m128i c9 = _mm_set1_epi8('9' - '0' - 128);
    const __m128i t2me = _mm_cmpgt_epi8(t2z, c9);
    if (!_mm_test_all_zeros(t2me, t2me)) {
      return 0;
    }

At least with GCC, this results in exactly the same instruction count and no benefit performance-wise.

then simply do a _mm_cmpgt_epi8(0x09) (instead of _mm_max_epu8 + _mm_cmpeq_epi8 + _mm_test_all_ones)

Are you taking into account that _mm_cmpgt_epi8 is a signed comparison? This means that 0x80 is smaller than 0x09 (among other things).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time on this code and there is not a whole lot to save. If you omit the constants and the if clause, we are talking about ~2 instructions. If we could combine it with some other check, like @aqrit, that might save us quite a bit, e.g., by taking out one branch. However, it is not obvious how to do it safely.

However, I am going to have to write a routine that checks that we only have dots and digits, and if I am going to do that anyhow, I will be able to save on this routine here (we just need to check that there is no '.' since the only possibilities are digits and dots).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too worried about _mm_max_epu8 per se. It's more that we'll need the length (or if there's another way, that's fine too) to verify there's a delimiter right after the address (after #64 is merged). I figured picking up a __m128i (mask) to do an _mm_and_si128 to zero out non-digits (as opposed to _mm_blendv_epi8), move the check forward and base it on a bitmask instead, we may even be a little faster(?) The bitmask could also be used to count trailing zeroes (loading that from a table is another possibility). My thinking was: if there's any bit set before the delimiter, that is not set in the dotmask or digit_mask, we'd have an invalid input. Maybe even combine multiple checks (either by doing them sooner or by delaying and doing them later) that would save even more (as @aqrit suggested). And you're right, I did not take into account _mm_cmpgt_epi8 is signed 😅. Only after toying with the code yesterday, did I find out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a commit that provides two functions, one with a length parameter and one without (sse_inet_aton_16). The new code does not do the _mm_max_epu8 thing nor does it do blending because we know that we only have digits or dots to check (or else, the length will not match). So you save some work there, but you pay for the length computation. It a net loss in throughput unfortunately because of the long chain of dependencies...

@lemire
Copy link
Collaborator Author

lemire commented Jun 12, 2023

@aqrit Thanks aqrit.

This was referenced Jun 14, 2023
@lemire
Copy link
Collaborator Author

lemire commented Jun 14, 2023

@aqrit My last commit should address your concern. I had indeed omitted to check the length.

5be9c78

This was always meant to be in the code, I just somehow dropped it accidentally.

I have integrated your test in my blog post's code (with credit to you).

The problem is: The hash entry expects a certain dotmask but never actually verifies that it has it.

That's what @WojciechMula did, but that's relatively expensive. You don't need to do that and that's part of why my routine is faster.

With the shuffle mask, in there, we have the expected length of the pattern. You just have to check that it matches what is expected. The reason why that is so is not immediately obvious, but it ties in into another difference between what @WojciechMula did and what I do: I validate the digits after shuffling. This makes sure that if I have a '.' at an unexpected location, I catch it. So basically, I know that the input as the expected length, and that it has digits at all the expected locations. What remains is that it could have missing dots (locations where I expect a dot, but there is something there). But you can verify that it cannot happen (you can just brute force the check, it is very fast).

@lemire
Copy link
Collaborator Author

lemire commented Jun 14, 2023

@aqrit @k0ekk0ek Next, I am looking at the optimizations that you are proposing. :-)

@lemire
Copy link
Collaborator Author

lemire commented Jun 14, 2023

Ok. If we are not provided the length, then we can do the work using the same number of instructions. Sadly, we suffer due to a longer critical path under GCC...

sse_inet_aton                  :   2.19 GB/s  165.1 Ma/s   6.06 ns/a   3.19 GHz  19.35 c/a  65.01 i/a    1.5 c/b   4.90 i/b   3.36 i/c 
sse_inet_aton_16               :   1.88 GB/s  141.4 Ma/s   7.07 ns/a   3.19 GHz  22.60 c/a  62.01 i/a    1.7 c/b   4.67 i/b   2.74 i/c 
inet_pton                      :   0.38 GB/s   28.9 Ma/s  34.61 ns/a   3.19 GHz  110.49 c/a  308.89 i/a    8.3 c/b  23.28 i/b   2.80 i/c 

Interestingly, if I switch to LLVM/clang, then the gap remains, but everything is just faster...

sse_inet_aton                  :   2.38 GB/s  179.4 Ma/s   5.57 ns/a   3.19 GHz  17.81 c/a  54.01 i/a    1.3 c/b   4.07 i/b   3.03 i/c 
sse_inet_aton_16               :   2.09 GB/s  157.6 Ma/s   6.35 ns/a   3.19 GHz  20.27 c/a  55.01 i/a    1.5 c/b   4.15 i/b   2.71 i/c 
inet_pton                      :   0.38 GB/s   29.0 Ma/s  34.54 ns/a   3.19 GHz  110.27 c/a  308.89 i/a    8.3 c/b  23.28 i/b   2.80 i/c 

This is LLVM 16. This would be worth investigating.

@lemire
Copy link
Collaborator Author

lemire commented Jun 14, 2023

I managed to save some instructions, but it just does not help the speed a lot...

sse_inet_aton                  :   2.19 GB/s  165.0 Ma/s   6.06 ns/a   3.19 GHz  19.36 c/a  65.01 i/a    1.5 c/b   4.90 i/b   3.36 i/c 
sse_inet_aton_16               :   1.92 GB/s  144.4 Ma/s   6.92 ns/a   3.19 GHz  22.12 c/a  60.01 i/a    1.7 c/b   4.52 i/b   2.71 i/c 
inet_pton                      :   0.38 GB/s   29.0 Ma/s  34.51 ns/a   3.19 GHz  110.18 c/a  308.89 i/a    8.3 c/b  23.28 i/b   2.80 i/c 

Still... we are 5x faster than inet_pton with sse_inet_aton_16, so that's good.

@lemire lemire requested review from aqrit and k0ekk0ek June 14, 2023 20:34
@k0ekk0ek
Copy link
Contributor

To provide some background on removal of length:

#30 describes how the scanner (stage1) is simpler if we don't need to provide the length. This has to do with the fact that a field can be delimited by a space, which we don't have to keep, or a character that has significance, say a newline, quote, etc (ends one token, starts another). The latter requires us to retain that index, so extra ops for every token even when it's not used (TTL, etc). To further complicate things, the zone format allows for newlines in quoted text (never happens, but still), so in order to keep parsing functions clean, I opted to handle newline count in the scanner. To avoid storing+reading the embedded newline count if there's no need, I needed a way to signal that there's embedded newlines. I did so by storing a pointer to a specialized string instead. However, that trick is only possible if we don't have to calculate the length. My reasoning is that for every data-type but strings and names, we know what to scan for. If it's an integer it's digits, etc. Everything that follows must then be a delimiter or the value is invalid.

It's possible that there's a way that allows us to keep the length, but after much trail-and-error I wasn't able to find it (multiple tapes, more complicated indexes, etc). The zone format doesn't allow for much leeway.

So, we pay a little for removing the length here, but we regain some in the scanner. I think it's the right tradeoff, but I could be wrong...

const __m128i t6 = _mm_packus_epi16(t5, t5);
uint32_t address = (uint32_t)_mm_cvtsi128_si32(t6);
memcpy(destination, &address, 4);
return (int)(ipv4_string_length - (size_t)pat[6]);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs coercion to bool
return (int)((ipv4_string_length - (size_t)pat[6]) == 1);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function returns 1 on success. It returns an integer value different from 1 on failure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check how it is used:

if (sse_inet_aton_16(token->data, &parser->rdata->octets[parser->rdata->length]) != 1)

(This is basically how @k0ekk0ek wrote it.)

@lemire
Copy link
Collaborator Author

lemire commented Jun 14, 2023

#30 describes how the scanner (stage1) is simpler if we don't need to provide the length. (...)

Yes, yes, yes. I wasn't arguing or disputing the choice.

So, we pay a little for removing the length here, but we regain some in the scanner. I think it's the right tradeoff, but I could be wrong...

I don't think it is necessarily going to be slower. We have the same instruction count. In my synthetic benchmark, it is slower because there is a longer dependency chain... But once this code gets inlined with other code, it is possible that it will all get sorted out.

The way I would address this is by benchmarking it and by looking at the instruction-per-cycle count.

I was not discouraging the architectural change to be clear.

@lemire
Copy link
Collaborator Author

lemire commented Jun 14, 2023

@k0ekk0ek In my last commit, I export the size parameter and I compare it against the token length. I am not sure why this is helpful, but it is my interpretation of what you asked.

@k0ekk0ek
Copy link
Contributor

I didn't read it as such. (anything in simdzone is up for debate btw, up to now it's been mostly me and for many things better alternatives may exist)

One more scenario that supports why not providing the length is (probably?) the right choice (or, not the worst 😅) is described here. So, for the ipv4hint service parameter (SVCB RRTYPE), we can check if token->data[ipv4_length] == ',' and use the function as a building block.

Great stuff @lemire! I'll happily settle for a 5x speedup 😄

I think it's best to merge this and for me to touch up #64 afterwards.

@k0ekk0ek
Copy link
Contributor

@k0ekk0ek In my last commit, I export the size parameter and I compare it against the token length. I am not sure why this is helpful, but it is my interpretation of what you asked.

Your interpretation is correct. This could be an oversight on my part, I thought the length check was removed, but in the original is still there (my apologies). Let's keep it as it is now, merge it. I'll then make #64 work with the new code tomorrow.

@k0ekk0ek k0ekk0ek merged commit 4736daf into NLnetLabs:main Jun 15, 2023
@aqrit
Copy link

aqrit commented Jun 16, 2023

Here's a spin on this:
https://gist.github.com/aqrit/93943a7861048c32d6da05a0fcc11e5d

It is branch-free, except for the hash bound check.
Note: limited testing, no benchmarks.

@k0ekk0ek
Copy link
Contributor

Thanks @aqrit! Really nice of you to bring this to our attention.

I'll study it after the weekend. If it works as advertised, are we allowed to use it under the BSD-3-Clause license?

Of course, we'd need to benchmark too. @lemire, don't feel obligated, but I expect you're interested?

@aqrit, if you like this kind of puzzle, and if you want (don't feel obligated by any means), any input on deserialization of data-types we're trying to vectorize would be appreciated. (there's certainly more than enough to go around 😅)

@lemire
Copy link
Collaborator Author

lemire commented Jun 16, 2023

I will review.

@k0ekk0ek
Copy link
Contributor

Scanned the code for a bit. I get the idea for check_lz and check_of to cut the number of branches. @lemire is more experienced wrt to benchmarking etc, so I'll let him determine if it we want to adopt (some of) the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Vectorize IPv4 deserialization

3 participants