-
-
Notifications
You must be signed in to change notification settings - Fork 22
Adding fast IPv4 parser. #65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
A quick test seems to show it not rejecting a number of invalid inputs (e.g. "111.1111.1.111" ). The problem is: The hash entry expects a certain dotmask but never actually verifies that it has it. Possible solution: |
k0ekk0ek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice @lemire. Thank you! I prefer we get the build errors out of the way though. Tests will not be required as I haven't added proper ones for my code, but I really prefer check marks are green before merging. Also, we should have a look at @aqrit's comment (thank you, @aqrit). I'm definitely willing to write a couple of tests for that afterwards to avoid regressions, but probably best to tackle the issue in the algorithm before merging? As for myself, I prefer we drop the use of token->length. I'll accept the code without that change and revisit it as part of my own PR, but this is as good a time as any to at least discuss it?
src/westmere/ip4.h
Outdated
| SEMANTIC_ERROR(parser, "Invalid %s in %s", | ||
| field->name.data, type->name.data); | ||
| // Note that this assumes that reading up to token->data + 16 is safe (i.e., we do not cross a page). | ||
| if (sse_inet_aton(token->data, token->length, &parser->rdata->octets[parser->rdata->length]) != 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
token->length will not be available after #64 is merged. Preferably, we account for that at this stage (we can worry about it later too, but your knowledge of the algorithm is still fresh now?). In an attempt to vectorize timestamp deserialization (#22, still WiP), I opted to use the result of the subtract operation to detect where the IP address ends. The maximum string size is 16 bytes (INET_ADDRSTRLEN), we can use that to get the dot mask instead of ipv4_string_length(?), then later on use the result of the subtract operation combined with the found dots to determine what should be kept? Then we can verify with the length wether or not the character that follows the address is a valid delimiter. At least, that's what I think is a better approach than keeping the length with the token... (as always, this is open for suggestions/discussion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a closer look at the code (there may be things I did not account for, and I'm still playing with it, but I wanted to share):
const __m128i ascii0_9 = _mm_setr_epi8(
'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 0, 0, 0, 0, 0, 0);
const __m128i digits = _mm_cmpeq_epi8(input, _mm_shuffle_epi8(ascii0_9, input));
// locate dots
uint16_t dotmask;
{
const __m128i dot = _mm_set1_epi8('.');
dotmask = (uint16_t)_mm_movemask_epi8(_mm_cmpeq_epi8(input, dot));
const uint16_t digit_mask = (uint16_t)_mm_movemask_epi8(digits);
uint16_t mask = ~(digit_mask | dotmask);
mask &= -mask;
dotmask &= mask - 1;
dotmask |= mask;
// FIXME: testing for weird input can now be done using masks but perhaps we can
// catch it when determining hashcode too?
}
We can then do the subtract before the shuffle:
const __m128i t0 = _mm_and_si128(_mm_sub_epi8(input, ascii0), digits);
But that does allow us to drop the blend and the range check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that does allow us to drop the blend and the range check?
It looks like it does, since it slices the input at the first bad char.
Here is a untested version with different bit twiddling:
uint16_t m = digit_mask | dotmask;
m ^= (m + 1); // mask of lowest clear bit and below
dotmask = ~digitmask & m;
// string_length = popcnt or lzcnt
| // check that leading digits of 2- 3- numbers are not zeros. | ||
| { | ||
| const __m128i eq0 = _mm_cmpeq_epi8(t1, ascii0); | ||
| if (!_mm_testz_si128(eq0, _mm_set_epi8(-1, 0, -1, 0, -1, 0, -1, 0, | ||
| 0, 0, 0, 0, 0, 0, 0, 0))) { | ||
| return 0; | ||
| } | ||
| } | ||
| // replace null values with '0' | ||
| __m128i t1b = _mm_blendv_epi8(t1, ascii0, pattern); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this can be simplified. Next to the pattern, we could use an AND-mask and save that with the shuffle pattern? So, _mm_sub_epi8 then and with the mask (I'm assuming we need it to zero out bytes that weren't actually set in the textual presentation) and then simply do a _mm_cmpgt_epi8(0x09) (instead of _mm_max_epu8 + _mm_cmpeq_epi8 + _mm_test_all_ones). This may be a completely bogus idea (I should really study the algorithm better, I'll do so today or tomorrow, but it's something that popped into my head...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "check that everything was in the range '0' to '9'" check could be combined with with the overflow check.
_mm_maddubs_epi16 sign extends, so the trick is to set bad chars to 0x80 or some other value (or range) that would work. Then use _mm_adds_epu16() to combine the high and low 64-bit lanes.
edit: but that might put this check on the critical path
Note: the intrinsic guide shows _mm_test_all_ones() as a sequence unlike all the other PTEST mnemonics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@k0ekk0ek An alternative, if you don't like _mm_max_epu8 is to do...
const __m128i t2z = _mm_add_epi8(t2,_mm_set1_epi8(-128) );
const __m128i c9 = _mm_set1_epi8('9' - '0' - 128);
const __m128i t2me = _mm_cmpgt_epi8(t2z, c9);
if (!_mm_test_all_zeros(t2me, t2me)) {
return 0;
}At least with GCC, this results in exactly the same instruction count and no benefit performance-wise.
then simply do a _mm_cmpgt_epi8(0x09) (instead of _mm_max_epu8 + _mm_cmpeq_epi8 + _mm_test_all_ones)
Are you taking into account that _mm_cmpgt_epi8 is a signed comparison? This means that 0x80 is smaller than 0x09 (among other things).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent some time on this code and there is not a whole lot to save. If you omit the constants and the if clause, we are talking about ~2 instructions. If we could combine it with some other check, like @aqrit, that might save us quite a bit, e.g., by taking out one branch. However, it is not obvious how to do it safely.
However, I am going to have to write a routine that checks that we only have dots and digits, and if I am going to do that anyhow, I will be able to save on this routine here (we just need to check that there is no '.' since the only possibilities are digits and dots).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too worried about _mm_max_epu8 per se. It's more that we'll need the length (or if there's another way, that's fine too) to verify there's a delimiter right after the address (after #64 is merged). I figured picking up a __m128i (mask) to do an _mm_and_si128 to zero out non-digits (as opposed to _mm_blendv_epi8), move the check forward and base it on a bitmask instead, we may even be a little faster(?) The bitmask could also be used to count trailing zeroes (loading that from a table is another possibility). My thinking was: if there's any bit set before the delimiter, that is not set in the dotmask or digit_mask, we'd have an invalid input. Maybe even combine multiple checks (either by doing them sooner or by delaying and doing them later) that would save even more (as @aqrit suggested). And you're right, I did not take into account _mm_cmpgt_epi8 is signed 😅. Only after toying with the code yesterday, did I find out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed a commit that provides two functions, one with a length parameter and one without (sse_inet_aton_16). The new code does not do the _mm_max_epu8 thing nor does it do blending because we know that we only have digits or dots to check (or else, the length will not match). So you save some work there, but you pay for the length computation. It a net loss in throughput unfortunately because of the long chain of dependencies...
|
@aqrit Thanks aqrit. |
|
@aqrit My last commit should address your concern. I had indeed omitted to check the length. This was always meant to be in the code, I just somehow dropped it accidentally. I have integrated your test in my blog post's code (with credit to you).
That's what @WojciechMula did, but that's relatively expensive. You don't need to do that and that's part of why my routine is faster. With the shuffle mask, in there, we have the expected length of the pattern. You just have to check that it matches what is expected. The reason why that is so is not immediately obvious, but it ties in into another difference between what @WojciechMula did and what I do: I validate the digits after shuffling. This makes sure that if I have a '.' at an unexpected location, I catch it. So basically, I know that the input as the expected length, and that it has digits at all the expected locations. What remains is that it could have missing dots (locations where I expect a dot, but there is something there). But you can verify that it cannot happen (you can just brute force the check, it is very fast). |
|
Ok. If we are not provided the length, then we can do the work using the same number of instructions. Sadly, we suffer due to a longer critical path under GCC... Interestingly, if I switch to LLVM/clang, then the gap remains, but everything is just faster... This is LLVM 16. This would be worth investigating. |
|
I managed to save some instructions, but it just does not help the speed a lot... Still... we are 5x faster than |
|
To provide some background on removal of length: #30 describes how the scanner (stage1) is simpler if we don't need to provide the length. This has to do with the fact that a field can be delimited by a space, which we don't have to keep, or a character that has significance, say a newline, quote, etc (ends one token, starts another). The latter requires us to retain that index, so extra ops for every token even when it's not used (TTL, etc). To further complicate things, the zone format allows for newlines in quoted text (never happens, but still), so in order to keep parsing functions clean, I opted to handle newline count in the scanner. To avoid storing+reading the embedded newline count if there's no need, I needed a way to signal that there's embedded newlines. I did so by storing a pointer to a specialized string instead. However, that trick is only possible if we don't have to calculate the length. My reasoning is that for every data-type but strings and names, we know what to scan for. If it's an integer it's digits, etc. Everything that follows must then be a delimiter or the value is invalid. It's possible that there's a way that allows us to keep the length, but after much trail-and-error I wasn't able to find it (multiple tapes, more complicated indexes, etc). The zone format doesn't allow for much leeway. So, we pay a little for removing the length here, but we regain some in the scanner. I think it's the right tradeoff, but I could be wrong... |
| const __m128i t6 = _mm_packus_epi16(t5, t5); | ||
| uint32_t address = (uint32_t)_mm_cvtsi128_si32(t6); | ||
| memcpy(destination, &address, 4); | ||
| return (int)(ipv4_string_length - (size_t)pat[6]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs coercion to bool
return (int)((ipv4_string_length - (size_t)pat[6]) == 1);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function returns 1 on success. It returns an integer value different from 1 on failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check how it is used:
if (sse_inet_aton_16(token->data, &parser->rdata->octets[parser->rdata->length]) != 1)
(This is basically how @k0ekk0ek wrote it.)
Yes, yes, yes. I wasn't arguing or disputing the choice.
I don't think it is necessarily going to be slower. We have the same instruction count. In my synthetic benchmark, it is slower because there is a longer dependency chain... But once this code gets inlined with other code, it is possible that it will all get sorted out. The way I would address this is by benchmarking it and by looking at the instruction-per-cycle count. I was not discouraging the architectural change to be clear. |
|
@k0ekk0ek In my last commit, I export the size parameter and I compare it against the token length. I am not sure why this is helpful, but it is my interpretation of what you asked. |
|
I didn't read it as such. (anything in simdzone is up for debate btw, up to now it's been mostly me and for many things better alternatives may exist) One more scenario that supports why not providing the length is (probably?) the right choice (or, not the worst 😅) is described here. So, for the ipv4hint service parameter (SVCB RRTYPE), we can check if Great stuff @lemire! I'll happily settle for a 5x speedup 😄 I think it's best to merge this and for me to touch up #64 afterwards. |
Your interpretation is correct. This could be an oversight on my part, I thought the length check was removed, but in the original is still there (my apologies). Let's keep it as it is now, merge it. I'll then make #64 work with the new code tomorrow. |
|
Here's a spin on this: It is branch-free, except for the hash bound check. |
|
Thanks @aqrit! Really nice of you to bring this to our attention. I'll study it after the weekend. If it works as advertised, are we allowed to use it under the BSD-3-Clause license? Of course, we'd need to benchmark too. @lemire, don't feel obligated, but I expect you're interested? @aqrit, if you like this kind of puzzle, and if you want (don't feel obligated by any means), any input on deserialization of data-types we're trying to vectorize would be appreciated. (there's certainly more than enough to go around 😅) |
|
I will review. |
|
Scanned the code for a bit. I get the idea for |
This adds the IPv4 parser from https://lemire.me/blog/2023/06/08/parsing-ip-addresses-crazily-fast/ to the project.
This was neither benchmarked nor tested in simdzone, though I have my own tests, some of them contributed by @aqrit
Fixes #23