Adding fast IPv4 parser. #65

lemire · 2023-06-09T16:23:27Z

This adds the IPv4 parser from https://lemire.me/blog/2023/06/08/parsing-ip-addresses-crazily-fast/ to the project.

This was neither benchmarked nor tested in simdzone, though I have my own tests, some of them contributed by @aqrit

Fixes #23

aqrit · 2023-06-11T20:48:55Z

A quick test seems to show it not rejecting a number of invalid inputs (e.g. "111.1111.1.111" ).
https://gist.github.com/aqrit/8c5b25f079508d5417080d048a097b9e#file-test_sse_inet_aton-c
Note: Some of those errors are duplicates. Not all errors are caught (e.g. when a shuffle would grab bytes beyond strlen).

The problem is: The hash entry expects a certain dotmask but never actually verifies that it has it.

Possible solution:
To avoid another table...
The shuffle control mask has three unused bits in each byte (that are ignored by pshufb).
The expected positions of the dots_char could be marked in the pattern by setting bit6 of that byte.
Then the positions could be extracted using paddb x,x and pmovmskb x
Which could be compared against the actual dotmask that was fed into the perfect hash function.

k0ekk0ek

Very nice @lemire. Thank you! I prefer we get the build errors out of the way though. Tests will not be required as I haven't added proper ones for my code, but I really prefer check marks are green before merging. Also, we should have a look at @aqrit's comment (thank you, @aqrit). I'm definitely willing to write a couple of tests for that afterwards to avoid regressions, but probably best to tackle the issue in the algorithm before merging? As for myself, I prefer we drop the use of token->length. I'll accept the code without that change and revisit it as part of my own PR, but this is as good a time as any to at least discuss it?

k0ekk0ek · 2023-06-12T09:28:19Z

src/westmere/ip4.h

+    SEMANTIC_ERROR(parser, "Invalid %s in %s",
+                   field->name.data, type->name.data);
+  // Note that this assumes that reading up to token->data + 16 is safe (i.e., we do not cross a page).
+  if (sse_inet_aton(token->data, token->length, &parser->rdata->octets[parser->rdata->length]) != 1)


token->length will not be available after #64 is merged. Preferably, we account for that at this stage (we can worry about it later too, but your knowledge of the algorithm is still fresh now?). In an attempt to vectorize timestamp deserialization (#22, still WiP), I opted to use the result of the subtract operation to detect where the IP address ends. The maximum string size is 16 bytes (INET_ADDRSTRLEN), we can use that to get the dot mask instead of ipv4_string_length(?), then later on use the result of the subtract operation combined with the found dots to determine what should be kept? Then we can verify with the length wether or not the character that follows the address is a valid delimiter. At least, that's what I think is a better approach than keeping the length with the token... (as always, this is open for suggestions/discussion)

Taking a closer look at the code (there may be things I did not account for, and I'm still playing with it, but I wanted to share):

const __m128i ascii0_9 = _mm_setr_epi8( '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 0, 0, 0, 0, 0, 0); const __m128i digits = _mm_cmpeq_epi8(input, _mm_shuffle_epi8(ascii0_9, input)); // locate dots uint16_t dotmask; { const __m128i dot = _mm_set1_epi8('.'); dotmask = (uint16_t)_mm_movemask_epi8(_mm_cmpeq_epi8(input, dot)); const uint16_t digit_mask = (uint16_t)_mm_movemask_epi8(digits); uint16_t mask = ~(digit_mask | dotmask); mask &= -mask; dotmask &= mask - 1; dotmask |= mask; // FIXME: testing for weird input can now be done using masks but perhaps we can // catch it when determining hashcode too? }

We can then do the subtract before the shuffle:

const __m128i t0 = _mm_and_si128(_mm_sub_epi8(input, ascii0), digits);

But that does allow us to drop the blend and the range check?

But that does allow us to drop the blend and the range check?

It looks like it does, since it slices the input at the first bad char.

Here is a untested version with different bit twiddling:

uint16_t m = digit_mask | dotmask; m ^= (m + 1); // mask of lowest clear bit and below dotmask = ~digitmask & m; // string_length = popcnt or lzcnt

src/westmere/ip4.h

src/westmere/parser.c

k0ekk0ek · 2023-06-12T10:26:58Z

src/westmere/ip4.h

+  // check that leading digits of 2- 3- numbers are not zeros.
+  {
+    const __m128i eq0 = _mm_cmpeq_epi8(t1, ascii0);
+    if (!_mm_testz_si128(eq0, _mm_set_epi8(-1, 0, -1, 0, -1, 0, -1, 0,
+                                           0, 0, 0, 0, 0, 0, 0, 0))) {
+      return 0;
+    }
+  }
+  // replace null values with '0'
+  __m128i t1b = _mm_blendv_epi8(t1, ascii0, pattern);


I'm wondering if this can be simplified. Next to the pattern, we could use an AND-mask and save that with the shuffle pattern? So, _mm_sub_epi8 then and with the mask (I'm assuming we need it to zero out bytes that weren't actually set in the textual presentation) and then simply do a _mm_cmpgt_epi8(0x09) (instead of _mm_max_epu8 + _mm_cmpeq_epi8 + _mm_test_all_ones). This may be a completely bogus idea (I should really study the algorithm better, I'll do so today or tomorrow, but it's something that popped into my head...)

The "check that everything was in the range '0' to '9'" check could be combined with with the overflow check.

_mm_maddubs_epi16 sign extends, so the trick is to set bad chars to 0x80 or some other value (or range) that would work. Then use _mm_adds_epu16() to combine the high and low 64-bit lanes.

edit: but that might put this check on the critical path

Note: the intrinsic guide shows _mm_test_all_ones() as a sequence unlike all the other PTEST mnemonics.

@k0ekk0ek An alternative, if you don't like _mm_max_epu8 is to do...

const __m128i t2z = _mm_add_epi8(t2,_mm_set1_epi8(-128) ); const __m128i c9 = _mm_set1_epi8('9' - '0' - 128); const __m128i t2me = _mm_cmpgt_epi8(t2z, c9); if (!_mm_test_all_zeros(t2me, t2me)) { return 0; }

At least with GCC, this results in exactly the same instruction count and no benefit performance-wise.

then simply do a _mm_cmpgt_epi8(0x09) (instead of _mm_max_epu8 + _mm_cmpeq_epi8 + _mm_test_all_ones)

Are you taking into account that _mm_cmpgt_epi8 is a signed comparison? This means that 0x80 is smaller than 0x09 (among other things).

I spent some time on this code and there is not a whole lot to save. If you omit the constants and the if clause, we are talking about ~2 instructions. If we could combine it with some other check, like @aqrit, that might save us quite a bit, e.g., by taking out one branch. However, it is not obvious how to do it safely.

However, I am going to have to write a routine that checks that we only have dots and digits, and if I am going to do that anyhow, I will be able to save on this routine here (we just need to check that there is no '.' since the only possibilities are digits and dots).

Not too worried about _mm_max_epu8 per se. It's more that we'll need the length (or if there's another way, that's fine too) to verify there's a delimiter right after the address (after #64 is merged). I figured picking up a __m128i (mask) to do an _mm_and_si128 to zero out non-digits (as opposed to _mm_blendv_epi8), move the check forward and base it on a bitmask instead, we may even be a little faster(?) The bitmask could also be used to count trailing zeroes (loading that from a table is another possibility). My thinking was: if there's any bit set before the delimiter, that is not set in the dotmask or digit_mask, we'd have an invalid input. Maybe even combine multiple checks (either by doing them sooner or by delaying and doing them later) that would save even more (as @aqrit suggested). And you're right, I did not take into account _mm_cmpgt_epi8 is signed 😅. Only after toying with the code yesterday, did I find out.

I just pushed a commit that provides two functions, one with a length parameter and one without (sse_inet_aton_16). The new code does not do the _mm_max_epu8 thing nor does it do blending because we know that we only have digits or dots to check (or else, the length will not match). So you save some work there, but you pay for the length computation. It a net loss in throughput unfortunately because of the long chain of dependencies...

lemire · 2023-06-12T12:27:18Z

@aqrit Thanks aqrit.

lemire · 2023-06-14T17:36:47Z

@aqrit My last commit should address your concern. I had indeed omitted to check the length.

5be9c78

This was always meant to be in the code, I just somehow dropped it accidentally.

I have integrated your test in my blog post's code (with credit to you).

The problem is: The hash entry expects a certain dotmask but never actually verifies that it has it.

That's what @WojciechMula did, but that's relatively expensive. You don't need to do that and that's part of why my routine is faster.

With the shuffle mask, in there, we have the expected length of the pattern. You just have to check that it matches what is expected. The reason why that is so is not immediately obvious, but it ties in into another difference between what @WojciechMula did and what I do: I validate the digits after shuffling. This makes sure that if I have a '.' at an unexpected location, I catch it. So basically, I know that the input as the expected length, and that it has digits at all the expected locations. What remains is that it could have missing dots (locations where I expect a dot, but there is something there). But you can verify that it cannot happen (you can just brute force the check, it is very fast).

lemire · 2023-06-14T17:40:26Z

@aqrit @k0ekk0ek Next, I am looking at the optimizations that you are proposing. :-)

lemire · 2023-06-14T20:07:53Z

Ok. If we are not provided the length, then we can do the work using the same number of instructions. Sadly, we suffer due to a longer critical path under GCC...

sse_inet_aton                  :   2.19 GB/s  165.1 Ma/s   6.06 ns/a   3.19 GHz  19.35 c/a  65.01 i/a    1.5 c/b   4.90 i/b   3.36 i/c 
sse_inet_aton_16               :   1.88 GB/s  141.4 Ma/s   7.07 ns/a   3.19 GHz  22.60 c/a  62.01 i/a    1.7 c/b   4.67 i/b   2.74 i/c 
inet_pton                      :   0.38 GB/s   28.9 Ma/s  34.61 ns/a   3.19 GHz  110.49 c/a  308.89 i/a    8.3 c/b  23.28 i/b   2.80 i/c

Interestingly, if I switch to LLVM/clang, then the gap remains, but everything is just faster...

sse_inet_aton                  :   2.38 GB/s  179.4 Ma/s   5.57 ns/a   3.19 GHz  17.81 c/a  54.01 i/a    1.3 c/b   4.07 i/b   3.03 i/c 
sse_inet_aton_16               :   2.09 GB/s  157.6 Ma/s   6.35 ns/a   3.19 GHz  20.27 c/a  55.01 i/a    1.5 c/b   4.15 i/b   2.71 i/c 
inet_pton                      :   0.38 GB/s   29.0 Ma/s  34.54 ns/a   3.19 GHz  110.27 c/a  308.89 i/a    8.3 c/b  23.28 i/b   2.80 i/c

This is LLVM 16. This would be worth investigating.

lemire · 2023-06-14T20:34:03Z

I managed to save some instructions, but it just does not help the speed a lot...

sse_inet_aton                  :   2.19 GB/s  165.0 Ma/s   6.06 ns/a   3.19 GHz  19.36 c/a  65.01 i/a    1.5 c/b   4.90 i/b   3.36 i/c 
sse_inet_aton_16               :   1.92 GB/s  144.4 Ma/s   6.92 ns/a   3.19 GHz  22.12 c/a  60.01 i/a    1.7 c/b   4.52 i/b   2.71 i/c 
inet_pton                      :   0.38 GB/s   29.0 Ma/s  34.51 ns/a   3.19 GHz  110.18 c/a  308.89 i/a    8.3 c/b  23.28 i/b   2.80 i/c

Still... we are 5x faster than inet_pton with sse_inet_aton_16, so that's good.

k0ekk0ek · 2023-06-14T20:48:07Z

To provide some background on removal of length:

#30 describes how the scanner (stage1) is simpler if we don't need to provide the length. This has to do with the fact that a field can be delimited by a space, which we don't have to keep, or a character that has significance, say a newline, quote, etc (ends one token, starts another). The latter requires us to retain that index, so extra ops for every token even when it's not used (TTL, etc). To further complicate things, the zone format allows for newlines in quoted text (never happens, but still), so in order to keep parsing functions clean, I opted to handle newline count in the scanner. To avoid storing+reading the embedded newline count if there's no need, I needed a way to signal that there's embedded newlines. I did so by storing a pointer to a specialized string instead. However, that trick is only possible if we don't have to calculate the length. My reasoning is that for every data-type but strings and names, we know what to scan for. If it's an integer it's digits, etc. Everything that follows must then be a delimiter or the value is invalid.

It's possible that there's a way that allows us to keep the length, but after much trail-and-error I wasn't able to find it (multiple tapes, more complicated indexes, etc). The zone format doesn't allow for much leeway.

So, we pay a little for removing the length here, but we regain some in the scanner. I think it's the right tradeoff, but I could be wrong...

aqrit · 2023-06-14T20:55:03Z

src/westmere/ip4.h

+  const __m128i t6 = _mm_packus_epi16(t5, t5);
+  uint32_t address =  (uint32_t)_mm_cvtsi128_si32(t6);
+  memcpy(destination, &address, 4);
+  return (int)(ipv4_string_length - (size_t)pat[6]);


needs coercion to bool
return (int)((ipv4_string_length - (size_t)pat[6]) == 1);

The function returns 1 on success. It returns an integer value different from 1 on failure.

Please check how it is used:

if (sse_inet_aton_16(token->data, &parser->rdata->octets[parser->rdata->length]) != 1)

(This is basically how @k0ekk0ek wrote it.)

lemire · 2023-06-14T21:39:35Z

#30 describes how the scanner (stage1) is simpler if we don't need to provide the length. (...)

Yes, yes, yes. I wasn't arguing or disputing the choice.

So, we pay a little for removing the length here, but we regain some in the scanner. I think it's the right tradeoff, but I could be wrong...

I don't think it is necessarily going to be slower. We have the same instruction count. In my synthetic benchmark, it is slower because there is a longer dependency chain... But once this code gets inlined with other code, it is possible that it will all get sorted out.

The way I would address this is by benchmarking it and by looking at the instruction-per-cycle count.

I was not discouraging the architectural change to be clear.

lemire · 2023-06-14T21:48:30Z

@k0ekk0ek In my last commit, I export the size parameter and I compare it against the token length. I am not sure why this is helpful, but it is my interpretation of what you asked.

k0ekk0ek · 2023-06-14T21:59:46Z

I didn't read it as such. (anything in simdzone is up for debate btw, up to now it's been mostly me and for many things better alternatives may exist)

One more scenario that supports why not providing the length is (probably?) the right choice (or, not the worst 😅) is described here. So, for the ipv4hint service parameter (SVCB RRTYPE), we can check if token->data[ipv4_length] == ',' and use the function as a building block.

Great stuff @lemire! I'll happily settle for a 5x speedup 😄

I think it's best to merge this and for me to touch up #64 afterwards.

k0ekk0ek · 2023-06-14T22:04:02Z

@k0ekk0ek In my last commit, I export the size parameter and I compare it against the token length. I am not sure why this is helpful, but it is my interpretation of what you asked.

Your interpretation is correct. This could be an oversight on my part, I thought the length check was removed, but in the original is still there (my apologies). Let's keep it as it is now, merge it. I'll then make #64 work with the new code tomorrow.

aqrit · 2023-06-16T01:34:06Z

Here's a spin on this:
https://gist.github.com/aqrit/93943a7861048c32d6da05a0fcc11e5d

It is branch-free, except for the hash bound check.
Note: limited testing, no benchmarks.

k0ekk0ek · 2023-06-16T07:40:29Z

Thanks @aqrit! Really nice of you to bring this to our attention.

I'll study it after the weekend. If it works as advertised, are we allowed to use it under the BSD-3-Clause license?

Of course, we'd need to benchmark too. @lemire, don't feel obligated, but I expect you're interested?

@aqrit, if you like this kind of puzzle, and if you want (don't feel obligated by any means), any input on deserialization of data-types we're trying to vectorize would be appreciated. (there's certainly more than enough to go around 😅)

lemire · 2023-06-16T11:09:16Z

I will review.

k0ekk0ek · 2023-06-19T09:23:00Z

Scanned the code for a bit. I get the idea for check_lz and check_of to cut the number of branches. @lemire is more experienced wrt to benchmarking etc, so I'll let him determine if it we want to adopt (some of) the code.

Daniel Lemire added 2 commits June 9, 2023 12:20

Adding fast IPv4 parser.

555cd41

Adding empty line.

e13f244

lemire requested a review from k0ekk0ek June 9, 2023 16:23

lemire mentioned this pull request Jun 9, 2023

Vectorize IPv4 deserialization #23

Closed

k0ekk0ek suggested changes Jun 12, 2023

View reviewed changes

k0ekk0ek reviewed Jun 12, 2023

View reviewed changes

This was referenced Jun 14, 2023

Faster parsing of domain names #66

Closed

Drop secondary index #64

Merged

I had omitted testing the output length

5be9c78

Adding a bunch of casts.

923c0de

Daniel Lemire added 2 commits June 14, 2023 16:18

Some fixes and a version without length parameter.

1518b3d

An optimization

41a45b7

lemire requested review from aqrit and k0ekk0ek June 14, 2023 20:34

aqrit reviewed Jun 14, 2023

View reviewed changes

Exposing the length and adding an additional check.

69d06cf

k0ekk0ek approved these changes Jun 14, 2023

View reviewed changes

Fix compiler warning

7f19be5

k0ekk0ek merged commit 4736daf into NLnetLabs:main Jun 15, 2023

Uh oh!

Adding fast IPv4 parser. #65

Adding fast IPv4 parser. #65

Uh oh!

Conversation

lemire commented Jun 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aqrit commented Jun 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k0ekk0ek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aqrit Jun 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aqrit Jun 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lemire commented Jun 12, 2023

Uh oh!

lemire commented Jun 14, 2023

Uh oh!

lemire commented Jun 14, 2023

Uh oh!

lemire commented Jun 14, 2023

Uh oh!

lemire commented Jun 14, 2023

Uh oh!

k0ekk0ek commented Jun 14, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lemire commented Jun 14, 2023

Uh oh!

lemire commented Jun 14, 2023

Uh oh!

k0ekk0ek commented Jun 14, 2023

Uh oh!

k0ekk0ek commented Jun 14, 2023

Uh oh!

aqrit commented Jun 16, 2023

Uh oh!

k0ekk0ek commented Jun 16, 2023

Uh oh!

lemire commented Jun 16, 2023

Uh oh!

k0ekk0ek commented Jun 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lemire commented Jun 9, 2023 •

edited

Loading

aqrit commented Jun 11, 2023 •

edited

Loading

aqrit Jun 13, 2023 •

edited

Loading

aqrit Jun 12, 2023 •

edited

Loading