-
-
Notifications
You must be signed in to change notification settings - Fork 21
Faster parsing of domain names #66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Rules regarding domain names and behavior of other implementations is now documented in FORMAT.md (#71), so that we at least have an idea of the constraints are. |
Checking for delimiters can be as simple as outlined on #6 (specific comment). The real speed-up will be in replacing the dots by the label lengths. Currently that is a scalar loop using bit manipulation instructions. I've been contemplating if there's a convenient way to write them out in bulk. So, first do vectorized classification for delimiters, backslashes (escape sequences) and dots. Then, something like broadcasting the current length to a register, add 0-15 using simd, get the relevant numbers next to each other and use something like hsub (vpcompress would sure help, but maybe there's another way? Perhaps PEXT? 64-bits is enough to hold 0-15 values for a 16-byte vector...). We'd need to save the last position in case the label spans blocks (very likely). @lemire, @aqrit, I know #28 is still in progress, but once that's done (great work btw, very much appreciated!), this is probably a very interesting issue to dive into? This format is so common and yet no results when searching for optimized parsing... It goes without saying, but only if you're interested and can spare the time. |
I will have a look. |
I have thoughts on emulating vpcompressb. However, I doubt that anything would beat a simdjson style bitmask_to_offset scalar loop here...? The general problem with PEXT is that it isn't available pre-avx2 and is very slow (~250 cycles) on AMD before Zen3. |
If you require AVX-512 then you can safely use pext. |
Possibly, but if we require AVX-512 then just using compress would be easier? @aqrit, indeed, it's likely the algorithm cannot be made faster, but its worth taking a second look? Ideally we can speed it up using SSE/AVX, but even if there's only a speedup for AVX-512, that'd be very interesting. It may be possible to use a trick like the one @lemire used for IPv4 conversion (generate hash from the bitmask), but my initial thought is there are too much possibilities and the table would get too big? Benchmarking is probably a good way to start. This was my first stab, so having a feel for the overall speed is probably the way to get started. At least, that's what I intend to do once I get some other things done. |
@k0ekk0ek It would be helpful to define precisely exactly what the problem. I thought I understood, but now I am very confused. Can you describe a benchmark? That is, describe (at least with examples) what the input is, and what the desired output is.
Just so we are clear... this is the function we are talking about, right? zone_nonnull_all
static zone_really_inline int32_t parse_name(
zone_parser_t *parser,
const zone_type_info_t *type,
const zone_field_info_t *field,
const token_t *token)
{
int32_t r;
size_t n = 0;
uint8_t *o = &parser->rdata->octets[parser->rdata->length];
if (zone_likely(token->code == CONTIGUOUS)) {
// a freestanding "@" denotes the current origin
if (token->data[0] == '@' && !is_contiguous((uint8_t)token->data[1]))
goto relative;
r = scan_contiguous_name(parser, type, field, token, o, &n);
if (r == 0)
return (void)(parser->rdata->length += n), ZONE_NAME;
if (r < 0)
return r;
} else if (token->code == QUOTED) {
r = scan_quoted_name(parser, type, field, token, o, &n);
if (r == 0)
return (void)(parser->rdata->length += n), ZONE_NAME;
if (r < 0)
return r;
} else {
return have_string(parser, type, field, token);
}
relative:
if (n > 255 - parser->file->origin.length)
SYNTAX_ERROR(parser, "Invalid %s in %s", NAME(field), TNAME(type));
memcpy(o+n, parser->file->origin.octets, parser->file->origin.length);
parser->rdata->length += n + parser->file->origin.length;
return ZONE_NAME;
} Presumably, the hot path is |
@lemire, indeed, that's the function. Domain names are usually non-quoted (I haven't seen any instance where they're quoted, but it's allowed according to the specification). The input would be domain names (99.99% of the time it will just be valid host names, but we need to handle all inputs). Domain names may contain any octet between 0-255, while host names are limited to The output would be wire format. e.g. The goal is to convert the dots to lengths of the labels (e.g. My thinking is to write out the lengths out as much as possible using the same I should note that this is just an idea, I have not tested this yet. I thought I'd share to see if others people have ideas.... |
@k0ekk0ek Very helpful. |
For SSE4.1: A table to cover all combinations 16 bytes would require 1 MiB.
edit2: A table that covers all combinations of 8 bytes would require ~440 bytes. |
With domain names having dots next to each other is illegal. Each fully qualified domain name has exactly one null-label (the |
For SSE4.1, Previously, I was thinking we'd want the lengths pre-computed and pre-positioned. Edit: An pseudocode:
|
I have a prototype. I have not looked at @aqrit code yet. My results show some promise...
My approach is not something I would propose, but it works... with some limitations. |
Here is my code.... static inline __m128i left_shift_bytes(__m128i x, int count) {
// We would like to shift by count bytes, but it cannot be done directly
// without immediates
__m128i p1 = _mm_sll_epi64(x, _mm_cvtsi64_si128(count * 8));
__m128i p2 = _mm_srl_epi64(_mm_unpacklo_epi64(_mm_setzero_si128(), x),
_mm_cvtsi64_si128(64 - count * 8));
return _mm_or_si128(p1, p2);
}
// This version processes at most 15 bytes from the input. A fallback would
// be necessary to use such code in production. TODO.
size_t name_to_dnswire_simd(const char *src, uint8_t *dst) {
const char *srcinit = src;
// Each label may contain from 1 to 63 octets. The empty label is
// reserved for the root node and when fully qualified is expressed
// as the empty label terminated by a dot.
// The full domain name may not exceed a total length of 253 ASCII characters
// in its textual representation.
//
// It is likely that many name fit under 16 bytes, however.
// We do vectorized classification to validate the content.
// We want to allow 0x30 to 0x39 (digits)
// The hyphen 0x2d.
// The dot 0x2e.
// The lower-cased letters 0x61-0x6f (a-o), 0x70-0x7a (p-z)
// The upper-cased letters 0x41-0x4f (A-O), 0x50-0x5a (P-Z)
const char DIGIT = 0x01;
const char HYPHENDOT = 0x02;
const char LETTERAO = 0x04;
const char LETTERPZ = 0x08;
static int8_t zero_masks[32] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1};
__m128i highnibbles =
_mm_setr_epi8(0, 0, HYPHENDOT, DIGIT, LETTERAO, LETTERPZ, LETTERAO,
LETTERPZ, 0, 0, 0, 0, 0, 0, 0, 0);
__m128i lownibbles =
_mm_setr_epi8(LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
LETTERAO | LETTERPZ, LETTERAO, LETTERAO,
LETTERAO | HYPHENDOT, LETTERAO | HYPHENDOT, LETTERAO);
// we always insert a fake '.' initially.
__m128i input = _mm_loadu_si128((const __m128i *)src);
input = _mm_alignr_epi8(input, _mm_set1_epi8('.'), 15);
src -= 1; // we pretend that we are pointing at the virtual '.'.
// We could possibly 'upper case everything' if we wanted to.
// __m128i inputlc = _mm_or_si128(input, _mm_set1_epi8(0x20));
__m128i low = _mm_shuffle_epi8(
lownibbles,
input); // no need for _mm_and_si128(input,_mm_set1_epi8(0xf)) because
// if high bit is set, there is no match
__m128i high = _mm_shuffle_epi8(
highnibbles, _mm_and_si128(_mm_srli_epi64(input, 4), _mm_set1_epi8(0xf)));
__m128i classified =
_mm_cmpeq_epi8(_mm_and_si128(low, high), _mm_setzero_si128());
// m cannot be zero!!!
unsigned m = (unsigned)_mm_movemask_epi8(
classified); // should be 1 wherever we have a match.
uint16_t length = (uint16_t)__builtin_ctz((unsigned int)m);
src += length;
__m128i zero_mask = _mm_loadu_si128((__m128i *)(zero_masks + 16 - length));
// masking with '.'
input = _mm_blendv_epi8(input, _mm_set1_epi8('.'), zero_mask);
//
__m128i dots = _mm_cmpeq_epi8(input, _mm_set1_epi8('.'));
unsigned int mask = (unsigned)_mm_movemask_epi8(dots);
__m128i sequential =
_mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
const __m128i dotscounts = _mm_and_si128(dots, sequential);
// We proceed to compute a shuffle mask that brings all counters/dots
// together. We could do it with a single large table (2**16 * 128 bytes)
// or 8 MB. Instead, we work from a 2 kB table and do more computation.
mask = mask ^ 0xFFFF; // negate
unsigned mask1 = (unsigned)(mask & 0xFF);
int pop = 8 - __builtin_popcount(mask1);
unsigned mask2 = (mask >> 8);
__m128i m1 = _mm_loadl_epi64((const __m128i *)(thintable_epi8 + mask1));
__m128i m2 = _mm_loadl_epi64((const __m128i *)(thintable_epi8 + mask2));
__m128i m2add = _mm_add_epi8(m2, _mm_set1_epi8(8));
__m128i m2shifted = left_shift_bytes(m2add, pop);
__m128i shufmask = _mm_or_si128(m2shifted, m1);
// The shuffle mask has been computed.
// We also need the *reverse* mask which we compute with a prefix sum !
__m128i dotones = _mm_and_si128(dots, _mm_set1_epi8(1));
dotones = _mm_add_epi8(dotones,
_mm_alignr_epi8(dotones, _mm_setzero_si128(), 16 - 1));
dotones = _mm_add_epi8(dotones,
_mm_alignr_epi8(dotones, _mm_setzero_si128(), 16 - 2));
dotones = _mm_add_epi8(dotones,
_mm_alignr_epi8(dotones, _mm_setzero_si128(), 16 - 4));
dotones = _mm_add_epi8(dotones,
_mm_alignr_epi8(dotones, _mm_setzero_si128(), 16 - 8));
dotones = _mm_sub_epi8(dotones, _mm_set1_epi8(1));
// Ok, dotones contains the reverse shuffle mask
// Pheeewww... This was a lot of work.
const __m128i packed_dotscounts = _mm_shuffle_epi8(dotscounts, shufmask);
// Need to subtract the counters.
// If there is an overflow, then we had two successive dots, we should error:
// TODO.
__m128i diffed_packed_dotscounts =
_mm_sub_epi8(_mm_alignr_epi8(_mm_setzero_si128(), packed_dotscounts, 1),
packed_dotscounts);
// need to subtract one to the counters.
diffed_packed_dotscounts =
_mm_sub_epi8(diffed_packed_dotscounts, _mm_set1_epi8(1));
// send it back...
__m128i magic = _mm_shuffle_epi8(diffed_packed_dotscounts, dotones);
// shift it
__m128i marked_input = _mm_blendv_epi8(input, magic, dots);
_mm_storeu_si128((__m128i *)dst, marked_input);
// dst += 16;
return (size_t)(src - srcinit);
} It only processes the first 15 bytes of the input. |
(Please note that this code is not something I seriously consider, I am only stating that it works and it is faster than a scalar implementation.) |
My current thoughts are that a compress/expand routine is probably not great for SSE. This would work well with AVX-512, but not so well here. I am going to try another design. |
Ok. I have an approach without compress/expand. No table needed. It is 2.7 times faster than the scalar approach. It is also relatively simple.
The code is as follows... No doubt it can be made better... size_t name_to_dnswire_simd_fast(const char *src, uint8_t *dst) {
const char *srcinit = src;
// Each label may contain from 1 to 63 octets. The empty label is
// reserved for the root node and when fully qualified is expressed
// as the empty label terminated by a dot.
// The full domain name may not exceed a total length of 253 ASCII characters
// in its textual representation.
//
// It is likely that many name fit under 16 bytes, however.
// We do vectorized classification to validate the content.
// We want to allow 0x30 to 0x39 (digits)
// The hyphen 0x2d.
// The dot 0x2e.
// The lower-cased letters 0x61-0x6f (a-o), 0x70-0x7a (p-z)
// The upper-cased letters 0x41-0x4f (A-O), 0x50-0x5a (P-Z)
const char DIGIT = 0x01;
const char HYPHENDOT = 0x02;
const char LETTERAO = 0x04;
const char LETTERPZ = 0x08;
static int8_t zero_masks[32] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1};
__m128i highnibbles =
_mm_setr_epi8(0, 0, HYPHENDOT, DIGIT, LETTERAO, LETTERPZ, LETTERAO,
LETTERPZ, 0, 0, 0, 0, 0, 0, 0, 0);
__m128i lownibbles =
_mm_setr_epi8(LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
LETTERAO | LETTERPZ, LETTERAO, LETTERAO,
LETTERAO | HYPHENDOT, LETTERAO | HYPHENDOT, LETTERAO);
// we always insert a fake '.' initially.
__m128i input = _mm_loadu_si128((const __m128i *)src);
input = _mm_alignr_epi8(input, _mm_set1_epi8('.'), 15);
src -= 1; // we pretend that we are pointing at the virtual '.'.
// We could possibly 'upper case everything' if we wanted to.
// __m128i inputlc = _mm_or_si128(input, _mm_set1_epi8(0x20));
__m128i low = _mm_shuffle_epi8(
lownibbles,
input); // no need for _mm_and_si128(input,_mm_set1_epi8(0xf)) because
// if high bit is set, there is no match
__m128i high = _mm_shuffle_epi8(
highnibbles, _mm_and_si128(_mm_srli_epi64(input, 4), _mm_set1_epi8(0xf)));
__m128i classified =
_mm_cmpeq_epi8(_mm_and_si128(low, high), _mm_setzero_si128());
// m cannot be zero!!!
unsigned m = (unsigned)_mm_movemask_epi8(
classified); // should be 1 wherever we have a match.
uint16_t length = (uint16_t)__builtin_ctz((unsigned int)m);
src += length;
__m128i zero_mask = _mm_loadu_si128((__m128i *)(zero_masks + 16 - length));
// masking with '.'
input = _mm_blendv_epi8(input, _mm_set1_epi8('.'), zero_mask);
//
__m128i dots = _mm_cmpeq_epi8(input, _mm_set1_epi8('.'));
__m128i sequential =
_mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
__m128i dotscounts = _mm_and_si128(dots, sequential);
__m128i marked = dots;
dotscounts = _mm_blendv_epi8(
_mm_alignr_epi8(_mm_setzero_si128(), dotscounts, 1), dotscounts, marked);
marked =
_mm_or_si128(marked, _mm_alignr_epi8(_mm_setzero_si128(), marked, 1));
dotscounts = _mm_blendv_epi8(
_mm_alignr_epi8(_mm_setzero_si128(), dotscounts, 2), dotscounts, marked);
marked =
_mm_or_si128(marked, _mm_alignr_epi8(_mm_setzero_si128(), marked, 2));
dotscounts = _mm_blendv_epi8(
_mm_alignr_epi8(_mm_setzero_si128(), dotscounts, 4), dotscounts, marked);
marked =
_mm_or_si128(marked, _mm_alignr_epi8(_mm_setzero_si128(), marked, 4));
dotscounts = _mm_blendv_epi8(
_mm_alignr_epi8(_mm_setzero_si128(), dotscounts, 8), dotscounts, marked);
__m128i next = _mm_alignr_epi8(_mm_setzero_si128(), dotscounts, 1);
dotscounts = _mm_sub_epi8(next, dotscounts);
// need to subtract one to the counters.
dotscounts = _mm_sub_epi8(dotscounts, _mm_set1_epi8(1));
// shift it
__m128i marked_input = _mm_blendv_epi8(input, dotscounts, dots);
_mm_storeu_si128((__m128i *)dst, marked_input);
// dst += 16;
return (size_t)(src - srcinit);
} |
Ok. I am now more than 3x faster...
Code: size_t name_to_dnswire_simd_fast(const char *src, uint8_t *dst) {
const char *srcinit = src;
// Each label may contain from 1 to 63 octets. The empty label is
// reserved for the root node and when fully qualified is expressed
// as the empty label terminated by a dot.
// The full domain name may not exceed a total length of 253 ASCII characters
// in its textual representation.
//
// It is likely that many name fit under 16 bytes, however.
// We do vectorized classification to validate the content.
// We want to allow 0x30 to 0x39 (digits)
// The hyphen 0x2d.
// The dot 0x2e.
// The lower-cased letters 0x61-0x6f (a-o), 0x70-0x7a (p-z)
// The upper-cased letters 0x41-0x4f (A-O), 0x50-0x5a (P-Z)
const char DIGIT = 0x01;
const char HYPHENDOT = 0x02;
const char LETTERAO = 0x04;
const char LETTERPZ = 0x08;
static int8_t zero_masks[32] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1};
__m128i highnibbles =
_mm_setr_epi8(0, 0, HYPHENDOT, DIGIT, LETTERAO, LETTERPZ, LETTERAO,
LETTERPZ, 0, 0, 0, 0, 0, 0, 0, 0);
__m128i lownibbles =
_mm_setr_epi8(LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
LETTERAO | LETTERPZ | DIGIT, LETTERAO | LETTERPZ | DIGIT,
LETTERAO | LETTERPZ, LETTERAO, LETTERAO,
LETTERAO | HYPHENDOT, LETTERAO | HYPHENDOT, LETTERAO);
// we always insert a fake '.' initially.
__m128i input = _mm_loadu_si128((const __m128i *)src);
input = _mm_alignr_epi8(input, _mm_set1_epi8('.'), 15);
src -= 1; // we pretend that we are pointing at the virtual '.'.
// We could possibly 'upper case everything' if we wanted to.
// __m128i inputlc = _mm_or_si128(input, _mm_set1_epi8(0x20));
__m128i low = _mm_shuffle_epi8(
lownibbles,
input); // no need for _mm_and_si128(input,_mm_set1_epi8(0xf)) because
// if high bit is set, there is no match
__m128i high = _mm_shuffle_epi8(
highnibbles, _mm_and_si128(_mm_srli_epi64(input, 4), _mm_set1_epi8(0xf)));
__m128i classified =
_mm_cmpeq_epi8(_mm_and_si128(low, high), _mm_setzero_si128());
// m cannot be zero!!!
unsigned m = (unsigned)_mm_movemask_epi8(
classified); // should be 1 wherever we have a match.
uint16_t length = (uint16_t)__builtin_ctz((unsigned int)m);
src += length;
__m128i zero_mask = _mm_loadu_si128((__m128i *)(zero_masks + 16 - length));
// masking with '.'
input = _mm_blendv_epi8(input, _mm_set1_epi8('.'), zero_mask);
//
__m128i dots = _mm_cmpeq_epi8(input, _mm_set1_epi8('.'));
__m128i sequential =
_mm_setr_epi8(-128, -127, -126, -125, -124, -123, -122, -121, -120, -119,
-118, -117, -116, -115, -114, -113);
__m128i dotscounts = _mm_and_si128(dots, sequential);
dotscounts =
_mm_blendv_epi8(_mm_alignr_epi8(_mm_setzero_si128(), dotscounts, 1),
dotscounts, dotscounts);
dotscounts =
_mm_blendv_epi8(_mm_alignr_epi8(_mm_setzero_si128(), dotscounts, 2),
dotscounts, dotscounts);
dotscounts =
_mm_blendv_epi8(_mm_alignr_epi8(_mm_setzero_si128(), dotscounts, 4),
dotscounts, dotscounts);
dotscounts =
_mm_blendv_epi8(_mm_alignr_epi8(_mm_setzero_si128(), dotscounts, 8),
dotscounts, dotscounts);
__m128i next = _mm_alignr_epi8(_mm_setzero_si128(), dotscounts, 1);
dotscounts = _mm_sub_epi8(next, dotscounts);
// need to subtract one to the counters.
dotscounts = _mm_sub_epi8(dotscounts, _mm_set1_epi8(1));
// shift it
__m128i marked_input = _mm_blendv_epi8(input, dotscounts, dots);
_mm_storeu_si128((__m128i *)dst, marked_input);
// dst += 16;
return (size_t)(src - srcinit);
} Possible assembly (GCC): name_to_dnswire_simd_fast(char const*, unsigned char*):
movdqu xmm1, XMMWORD PTR [rdi]
mov edx, 16
palignr xmm1, XMMWORD PTR .LC0[rip], 15
movdqa xmm0, XMMWORD PTR .LC1[rip]
movdqa xmm2, XMMWORD PTR .LC3[rip]
movdqa xmm3, xmm1
movdqa xmm5, xmm1
pshufb xmm0, xmm1
psrlq xmm3, 4
pxor xmm1, xmm1
pand xmm3, XMMWORD PTR .LC2[rip]
pshufb xmm2, xmm3
pand xmm0, xmm2
pxor xmm2, xmm2
pcmpeqb xmm0, xmm2
movdqa xmm3, XMMWORD PTR .LC4[rip]
movdqa xmm2, xmm1
movdqa xmm4, xmm1
pmovmskb eax, xmm0
bsf eax, eax
sub rdx, rax
sub rax, 1
movdqu xmm0, XMMWORD PTR name_to_dnswire_simd_fast(char const*, unsigned char*)::zero_masks[rdx]
pblendvb xmm5, xmm3, xmm0
movdqa xmm0, XMMWORD PTR .LC5[rip]
pcmpeqb xmm3, xmm5
pand xmm0, xmm3
palignr xmm2, xmm0, 1
pblendvb xmm2, xmm0, xmm0
movdqa xmm0, xmm2
palignr xmm4, xmm2, 2
pblendvb xmm4, xmm2, xmm0
movdqa xmm2, xmm1
movdqa xmm0, xmm4
palignr xmm2, xmm4, 4
pblendvb xmm2, xmm4, xmm0
movdqa xmm4, xmm1
palignr xmm4, xmm2, 8
movdqa xmm0, xmm2
movdqa xmm6, xmm4
pblendvb xmm6, xmm2, xmm0
pcmpeqd xmm2, xmm2
movdqa xmm0, xmm3
palignr xmm1, xmm6, 1
paddb xmm1, xmm2
psubb xmm1, xmm6
pblendvb xmm5, xmm1, xmm0
movups XMMWORD PTR [rsi], xmm5
ret Possible assembly (LLVM): name_to_dnswire_simd_fast(char const*, unsigned char*): # @name_to_dnswire_simd_fast(char const*, unsigned char*)
movdqu xmm4, xmmword ptr [rdi]
palignr xmm4, xmmword ptr [rip + .LCPI0_0], 15 # xmm4 = mem[15],xmm4[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14]
movdqa xmm0, xmmword ptr [rip + .LCPI0_1] # xmm0 = [9,13,13,13,13,13,13,13,13,13,12,4,4,6,6,4]
pshufb xmm0, xmm4
movdqa xmm1, xmm4
psrlq xmm1, 4
pand xmm1, xmmword ptr [rip + .LCPI0_2]
movdqa xmm2, xmmword ptr [rip + .LCPI0_3] # xmm2 = [0,0,2,1,4,8,4,8,0,0,0,0,0,0,0,0]
pshufb xmm2, xmm1
pand xmm2, xmm0
pxor xmm0, xmm0
pcmpeqb xmm0, xmm2
pmovmskb eax, xmm0
rep bsf ecx, eax
lea rax, [rip + name_to_dnswire_simd_fast(char const*, unsigned char*)::zero_masks]
sub rax, rcx
movups xmm0, xmmword ptr [rax + 16]
movdqa xmm1, xmmword ptr [rip + .LCPI0_4] # xmm1 = [46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46]
pblendvb xmm4, xmm1, xmm0
pcmpeqb xmm1, xmm4
movdqa xmm3, xmmword ptr [rip + .LCPI0_5] # xmm3 = [128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143]
pand xmm3, xmm1
movdqa xmm2, xmm3
psrldq xmm2, 1 # xmm2 = xmm2[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],zero
movdqa xmm0, xmm1
pblendvb xmm2, xmm3, xmm0
movdqa xmm3, xmm2
psrldq xmm3, 2 # xmm3 = xmm3[2,3,4,5,6,7,8,9,10,11,12,13,14,15],zero,zero
movdqa xmm0, xmm2
pblendvb xmm3, xmm2, xmm0
movdqa xmm2, xmm3
psrldq xmm2, 4 # xmm2 = xmm2[4,5,6,7,8,9,10,11,12,13,14,15],zero,zero,zero,zero
movdqa xmm0, xmm3
pblendvb xmm2, xmm3, xmm0
movdqa xmm3, xmm2
psrldq xmm3, 8 # xmm3 = xmm3[8,9,10,11,12,13,14,15],zero,zero,zero,zero,zero,zero,zero,zero
movdqa xmm0, xmm2
pblendvb xmm3, xmm2, xmm0
pcmpeqd xmm2, xmm2
pxor xmm2, xmm3
psrldq xmm3, 1 # xmm3 = xmm3[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],zero
paddb xmm2, xmm3
movdqa xmm0, xmm1
pblendvb xmm4, xmm2, xmm0
movdqu xmmword ptr [rsi], xmm4
mov rax, rdi
not rax
add rax, rdi
add rax, rcx
ret |
My benchmark is limited to small inputs (like google.com) so being 3x faster is actually good. I think that this routine could be made general, and ported to AVX. Comments invited. |
I like it. Maybe use |
Min would work, yes. It appears to have better performance on paper. |
Ok. Min is faster...
I am not going to repost the code, it is basically the same thing but some blendv are replaced by min instructions. It does not change the instruction count, but we increase the number of instructions per cycle. The data source for the benchmark are these guys...
So if we can be highly efficient over such short strings, I am certain that we can be fast over more general strings (the longer the string, the faster you can go generally). So now it is a matter of making the code more robust and including AVX2. Let me restate that the current code assumes that the input fits in 15 characters... handling the more general case is not super hard... but not entirely trivial either. |
Nice! I'm going to study the code and come back with technical comments, but at this point it's probably also worth sharing some of the intricacies of zone data(?)
(background information, for completeness sake, feel free to skip) An RR (resource record) is expressed as Zone files consist of a sequence of RRs. The presentation format (what we parse), is used to express them in plain text. The format is most frequently used for defining zones, but is used in other scenarios too. A zone contains authoritative data for a subtree of the DNS. What that data is like depends on the zone. e.g. For brevity sake, users may put
Or:
Or (owner is copied if line starts with space):
The latest |
I will be extending the code to go over 15 bytes, obviously. My recommendation is to provide a fast path that handles the common and simple cases and to use a fallback for the general case. |
Sounds good @lemire! I saw the code is available in your repo, I'll study it asap to see if I can help out. |
No need. I need to fix it up first. I just need time. |
It works now over general inputs...
|
Finally found time to take a decent look at this. Makes sense, starting from the back. Simplifies the state problem a lot. I love it, great stuff @lemire! I only did minimum testing, but the returned length is off-by-one and null-labels are accepted. e.g. if I input It's probably worth keeping the current implementation in there for reference? At least, it'd be good to see how it stacks up with different iterations. Lastly, and this is really a note to self, but it may be worth reintroducing the length with the token. For basically every input we're now first determining the length and only then can we proceed to do the actual operations we want. Before, name parsing was modeled after string parsing and we did a best effort thing. So, copy, then see if we need to add less than , not having a length kinda made sense. But now that @lemire has (very likely) cracked faster parsing of names, I should really take another look. I'm sure storing both indexes in the same tape is not going to work, but it may work to keep two tapes (simdjson jargon). So, one for the start, one for the end. We can do all logic based on the start index and make sure there's always a terminating index (we guarantee that now too). |
Would you produce a pull request? |
The newer algorithm is faster, even with real-world data, see PR#81. |
That detects
It adds 1 instruction to the critical path. |
Blog post: Coding of domain names to wire format at gigabytes per second. Code: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2023/08/09 (new location) The code does include a draft of @k0ekk0ek 's implementation, but it is not discussed in the blog post. |
Thanks @aqrit! The spec merely states all delimiters must be quoted, as a result we treat all non-delimiter bytes as valid input. In other words, it's a bug 😅 Thanks @lemire! I liked the post and it shows there's ways to improve, even if it's not perfect yet. Hopefully readers will have additional suggestions(?) I'll work on this as soon as I get a chance (my availability for the next two weeks is minimal). Your input and suggestions are much appreciated, as always. |
I'll try to see if I can efficiently get the length back. If it's feasible I'll create a new ticket outlining the idea. I'll also try to integrate @lemire's algorithm (work on detection internal null-labels for a bit too). @lemire, one of the reactions to your post outlines another idea, did you have a chance to work on it? |
@k0ekk0ek Kendall was making a reference to what I call the prefix-minimum approach. It is faster, but it does not include validation. His approach could potentially save a few instructions, maybe, on the prefix-minimum approach, but it won't get you validation, which I expect you want to have. It is also weirder code, harder to maintain. I would discourage you from considering it. I recommend the simdjson-like approach (i.e., name_to_dnswire_idx_avx): It is fast and it includes full validation. And it is maintainable. My code is not the nicest in the world, but it is not bad even as is. |
@lemire, thank you for clarifying Kendall's approach. Interesting EDIT: Ah, yes, illegal sequences like |
@k0ekk0ek Yes, it is similar and uses much of your own code. But I think it reflects the fact that it might be the right overall design. |
I think it's the right approach indeed. I'll see if I can clean it up a little and issue a PR against your repo so we have benchmarks. |
Fantastic! |
This is what I have so far (issue a PR too). It works sort-off the same as @lemire's recommendation, but it's easier to integrate handling of escape sequences (i.e. #define likely(params) __builtin_expect(!!(params), 1)
#define unlikely(params) __builtin_expect(!!(params), 0)
// simplified version of name_to_dnswire_idx_avx
size_t name_to_dnswire_loop(const char *src, uint8_t *dst)
{
const char *text = src;
uint8_t *octets = dst, *wire = octets + 1;
uint64_t label = 0;
octets[label] = 0;
// real world domain names quickly exceed 16 octets (www.example.com is
// encoded as 3www7example3com0, or 18 octets), but rarely exceed 32
// octets. encode in 32-byte blocks.
__m256i input = _mm256_loadu_si256((const __m256i *)text);
_mm256_storeu_si256((__m256i *)wire, input);
const __m256i dot = _mm256_set1_epi8('.');
uint64_t delimiter = delimiter_mask_avx(input);
uint64_t dots = (uint32_t)_mm256_movemask_epi8(_mm256_cmpeq_epi8(input, dot));
uint64_t limit = delimiter | (1llu << 32);
uint64_t length = _tzcnt_u64(limit);
if (unlikely(dots & 1llu)) // root (single ".") is a special case
return length == 1;
// FIXME: account for escape sequences
dots &= (1llu << length) - 1;
text += length;
wire += length;
// check for null labels, i.e. ".."
if (unlikely(dots & (dots >> 1)))
return 0;
if (dots) {
uint64_t base, count = _tzcnt_u64(dots);
dots &= dots - 1;
octets[label] = (uint8_t)count;
label += count + 1;
while (dots) {
base = count;
count = _tzcnt_u64(dots);
const uint64_t diff = count - base;
dots &= dots - 1;
octets[label] = (uint8_t)(diff - 1);
label += diff;
}
octets[label] = (uint8_t)((length - count) - 1);
} else {
octets[label] = (uint8_t)length;
}
if (likely(delimiter))
return length + 1;
// labels in domain names are limited to 63 octets. track length octets
// (dots) in 64-bit wide bitmap. shift by length of block last copied to
// detect null-labels and labels exceeding 63 octets (zero)
uint64_t labels = (dots << (64 - length)) | ((1llu << 63) >> length);
do {
input = _mm256_loadu_si256((const __m256i *)text);
_mm256_storeu_si256((__m256i *)wire, input);
delimiter = delimiter_mask_avx(input);
dots = (uint32_t)_mm256_movemask_epi8(_mm256_cmpeq_epi8(input, dot));
limit = delimiter | (1llu << 32);
length = _tzcnt_u64(limit);
// FIXME: account for escape sequences
dots &= (1llu << length) - 1;
text += length;
wire += length;
labels = (dots << (64 - length)) | (labels >> length);
// check for null labels, i.e. ".."
if (unlikely(labels & (labels >> 1)))
return 0;
if (dots) {
uint64_t base, count = _tzcnt_u64(dots);
dots &= dots - 1;
octets[label] += (uint8_t)count;
label += count + 1;
while (dots) {
base = count;
count = _tzcnt_u64(dots);
const uint64_t diff = count - base;
dots &= dots - 1;
octets[label] = (uint8_t)(diff - 1);
label += diff;
}
octets[label] = (uint8_t)((length - count) - 1);
} else {
// check if label exceeds 63 octets
if (!labels)
return 0;
octets[label] += (uint8_t)length;
}
} while (!delimiter);
return (size_t)(wire - dst);
} EDIT: I should note that the 32-byte block is a guestimate, 64-byte blocks work better for top-1m.csv, but I suspect 32-bytes is better overall. It's a hunch, I've not measured. |
Reproducing here the gist of the results... (
So your approach use about 15 extra instruction per parsed input. Overall, it is a performance difference of 20%. It probably does not matter. I think you can go with your approach and it should serve you well. I don't think you should be anxious about a 20% difference in a microbenchmark. |
@lemire, I updated my code to give it a slight boost and did a quick test with all owners from a |
@k0ekk0ek Would you give me access to this |
@lemire, you can obtain a current copy via zone transfer. From the looks of it, the data is very similar to the one I have, meaning lots of DNSSEC related data and the usual glue RRs. CONTRIBUTING.md in this repository also links to the Centralized Zone Data Service (CZDS), which allows you to download zone data for many more TLDs ( |
Good! Noted. |
#64 discards the secondary index (specifically for length) that was previously used for parsing domain names (a data type that occurs quite a lot in zone data 😅). With that change I focused on scanning for (described on #30) characters that are not part of the contiguous set. Since some of the characters overlap we need 2*
pshufb
+ 2*cmpeq
+or
(the way it is now, #30 describes filtering the input), we then follow-up with doing acmpeq('\\')
and acmpeq('"')
(this is all somewhat modeled after the string parsing in simdjson btw). Reading up on #65, I also read Wojciech Muła's SIMD-ized faster parse of IPv4 addresses article. In there he uses a clever trick to find digits.Maybe we can use a range check too to parse domain names faster. Most names use the Preferred name syntax and rarely do names contain escape sequences. If an escape sequence is encountered, the algorithm stops at that character, processes the single escaped character and picks-up from the first character after the sequence.
And instead of checking for
\
separately we can replace that with checking if the block contains something interesting to start with?The text was updated successfully, but these errors were encountered: