cmd/compile: combine BSWAP on amd64 #41684

josharian · 2020-09-28T20:38:12Z

package p

import "encoding/binary"

func f(b []byte, x *[8]byte) {
	_ = b[8]
	t := binary.BigEndian.Uint64(x[:])
	binary.BigEndian.PutUint64(b, t)
}

This should compile down to two MOVQs on amd64, one to load from x and one to write to b.

Instead, it contains two successive BSWAP instructions. We should eliminate those. As a bonus, we should eliminate any BSWAPS separated only by bitwise operations. For example:

package p

import "encoding/binary"

func f(b []byte, x *[8]byte) {
	_ = b[8]
	t := binary.BigEndian.Uint64(x[:])
	t = ^t
	binary.BigEndian.PutUint64(b, t)
}

(In addition to unary ^, there's binary &, |, and ^. Maybe some others?)

One advantage to this kind of optimization is that it reduces the important of whether the endianness selected in the code matches the endianness of the architecture the code is being executed on.

Related: #41663

cc @agarciamontoro

This allows IP manipulation operations to operate on IPs as two registers, which empirically leads to significant speedups, in particular on OOO superscalars where the two halves can be processed in parallel. You might expect that we could keep the representation as [16]byte, and do a cycle of BigEndian.Uint64+tweak+BigEndian.PutUint64, and that would compile down to efficient code. Unfortunately, due to a variety of missing optimizations in the Go compiler, that is not the case and that code turns into byte-wise operations. On the other hand, converting to a uint64 pair at construction results in efficient construction (a pair of MOVQ+BSWAPQ) and efficient twiddling operations (single-cycle arithmetic on 64-bit pairs). See also: - golang/go#41663 - golang/go#41684 - golang/go#42958 - The discussion in #63 name old time/op new time/op delta StdIPv4-8 146ns ± 2% 141ns ± 2% -3.42% (p=0.016 n=5+5) IPv4-8 120ns ± 1% 107ns ± 2% -10.65% (p=0.008 n=5+5) IPv4_inline-8 120ns ± 0% 118ns ± 1% -1.67% (p=0.016 n=4+5) StdIPv6-8 211ns ± 2% 215ns ± 1% +2.18% (p=0.008 n=5+5) IPv6-8 281ns ± 1% 252ns ± 1% -10.19% (p=0.008 n=5+5) IPv4Contains-8 11.8ns ± 4% 4.7ns ± 2% -60.00% (p=0.008 n=5+5) ParseIPv4-8 68.1ns ± 4% 78.8ns ± 1% +15.74% (p=0.008 n=5+5) ParseIPv6-8 419ns ± 1% 409ns ± 0% -2.40% (p=0.016 n=4+5) StdParseIPv4-8 73.7ns ± 1% 88.8ns ± 2% +20.50% (p=0.008 n=5+5) StdParseIPv6-8 132ns ± 2% 134ns ± 1% ~ (p=0.079 n=5+5) IPPrefixMasking/IPv4_/32-8 36.3ns ± 3% 4.8ns ± 4% -86.72% (p=0.008 n=5+5) IPPrefixMasking/IPv4_/17-8 39.0ns ± 0% 4.8ns ± 3% -87.78% (p=0.008 n=5+5) IPPrefixMasking/IPv4_/0-8 36.9ns ± 2% 4.8ns ± 4% -87.07% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/128-8 32.7ns ± 1% 4.7ns ± 2% -85.47% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/65-8 39.8ns ± 1% 4.7ns ± 1% -88.13% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/0-8 40.7ns ± 1% 4.7ns ± 2% -88.41% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/128-8 136ns ± 3% 5ns ± 2% -96.53% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 142ns ± 2% 5ns ± 1% -96.65% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 143ns ± 2% 5ns ± 3% -96.67% (p=0.008 n=5+5) IPSetFuzz-8 22.7µs ± 2% 16.4µs ± 2% -27.84% (p=0.008 n=5+5) name old alloc/op new alloc/op delta StdIPv4-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv4-8 0.00B 0.00B ~ (all equal) IPv4_inline-8 0.00B 0.00B ~ (all equal) StdIPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv4Contains-8 0.00B 0.00B ~ (all equal) ParseIPv4-8 0.00B 0.00B ~ (all equal) ParseIPv6-8 48.0B ± 0% 48.0B ± 0% ~ (all equal) StdParseIPv4-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) StdParseIPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPPrefixMasking/IPv4_/32-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv4_/17-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv4_/0-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/128-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/65-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/0-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_zone_/128-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPSetFuzz-8 2.60kB ± 0% 2.60kB ± 0% ~ (p=1.000 n=5+5) name old allocs/op new allocs/op delta StdIPv4-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv4-8 0.00 0.00 ~ (all equal) IPv4_inline-8 0.00 0.00 ~ (all equal) StdIPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv4Contains-8 0.00 0.00 ~ (all equal) ParseIPv4-8 0.00 0.00 ~ (all equal) ParseIPv6-8 3.00 ± 0% 3.00 ± 0% ~ (all equal) StdParseIPv4-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) StdParseIPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPPrefixMasking/IPv4_/32-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv4_/17-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv4_/0-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/128-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/65-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/0-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_zone_/128-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPSetFuzz-8 33.0 ± 0% 33.0 ± 0% ~ (all equal) Signed-off-by: David Anderson <[email protected]>

This allows IP manipulation operations to operate on IPs as two registers, which empirically leads to significant speedups, in particular on OOO superscalars where the two halves can be processed in parallel. You might expect that we could keep the representation as [16]byte, and do a cycle of BigEndian.Uint64+tweak+BigEndian.PutUint64, and that would compile down to efficient code. Unfortunately, due to a variety of missing optimizations in the Go compiler, that is not the case and that code turns into byte-wise operations. On the other hand, converting to a uint64 pair at construction results in efficient construction (a pair of MOVQ+BSWAPQ) and efficient twiddling operations (single-cycle arithmetic on 64-bit pairs). See also: - golang/go#41663 - golang/go#41684 - golang/go#42958 - The discussion in inetaf#63 name old time/op new time/op delta StdIPv4-8 146ns ± 2% 141ns ± 2% -3.42% (p=0.016 n=5+5) IPv4-8 120ns ± 1% 107ns ± 2% -10.65% (p=0.008 n=5+5) IPv4_inline-8 120ns ± 0% 118ns ± 1% -1.67% (p=0.016 n=4+5) StdIPv6-8 211ns ± 2% 215ns ± 1% +2.18% (p=0.008 n=5+5) IPv6-8 281ns ± 1% 252ns ± 1% -10.19% (p=0.008 n=5+5) IPv4Contains-8 11.8ns ± 4% 4.7ns ± 2% -60.00% (p=0.008 n=5+5) ParseIPv4-8 68.1ns ± 4% 78.8ns ± 1% +15.74% (p=0.008 n=5+5) ParseIPv6-8 419ns ± 1% 409ns ± 0% -2.40% (p=0.016 n=4+5) StdParseIPv4-8 73.7ns ± 1% 88.8ns ± 2% +20.50% (p=0.008 n=5+5) StdParseIPv6-8 132ns ± 2% 134ns ± 1% ~ (p=0.079 n=5+5) IPPrefixMasking/IPv4_/32-8 36.3ns ± 3% 4.8ns ± 4% -86.72% (p=0.008 n=5+5) IPPrefixMasking/IPv4_/17-8 39.0ns ± 0% 4.8ns ± 3% -87.78% (p=0.008 n=5+5) IPPrefixMasking/IPv4_/0-8 36.9ns ± 2% 4.8ns ± 4% -87.07% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/128-8 32.7ns ± 1% 4.7ns ± 2% -85.47% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/65-8 39.8ns ± 1% 4.7ns ± 1% -88.13% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/0-8 40.7ns ± 1% 4.7ns ± 2% -88.41% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/128-8 136ns ± 3% 5ns ± 2% -96.53% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 142ns ± 2% 5ns ± 1% -96.65% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 143ns ± 2% 5ns ± 3% -96.67% (p=0.008 n=5+5) IPSetFuzz-8 22.7µs ± 2% 16.4µs ± 2% -27.84% (p=0.008 n=5+5) name old alloc/op new alloc/op delta StdIPv4-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv4-8 0.00B 0.00B ~ (all equal) IPv4_inline-8 0.00B 0.00B ~ (all equal) StdIPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv4Contains-8 0.00B 0.00B ~ (all equal) ParseIPv4-8 0.00B 0.00B ~ (all equal) ParseIPv6-8 48.0B ± 0% 48.0B ± 0% ~ (all equal) StdParseIPv4-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) StdParseIPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPPrefixMasking/IPv4_/32-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv4_/17-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv4_/0-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/128-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/65-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/0-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_zone_/128-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPSetFuzz-8 2.60kB ± 0% 2.60kB ± 0% ~ (p=1.000 n=5+5) name old allocs/op new allocs/op delta StdIPv4-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv4-8 0.00 0.00 ~ (all equal) IPv4_inline-8 0.00 0.00 ~ (all equal) StdIPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv4Contains-8 0.00 0.00 ~ (all equal) ParseIPv4-8 0.00 0.00 ~ (all equal) ParseIPv6-8 3.00 ± 0% 3.00 ± 0% ~ (all equal) StdParseIPv4-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) StdParseIPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPPrefixMasking/IPv4_/32-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv4_/17-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv4_/0-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/128-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/65-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/0-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_zone_/128-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPSetFuzz-8 33.0 ± 0% 33.0 ± 0% ~ (all equal) Signed-off-by: David Anderson <[email protected]> Signed-off-by: Stefan Majer <[email protected]>

agarciamontoro · 2020-12-30T19:41:57Z

@josharian, I'd like to work on this one, too, as I guess it's quite similar to #41663

By the way, should this one be added to go1.17 too?

josharian · 2020-12-31T02:24:15Z

Great! Let me know if you have any questions.

agarciamontoro · 2021-05-07T12:00:49Z

I am working on this, finally, and the first part of the issue is easily solvable with a rule such as:

(BSWAPQ (BSWAPQ p)) => p

@josharian, do we need to check that the inner BSWAPQ is used only once? Something like this:

(BSWAPQ a:(BSWAPQ p))
    && a.Uses == 1
    && clobber(a) =>
    p

I think I understand what the Uses field is for and what the clobber function does, but I'm not sure I got it 100%.

Thanks!

EDIT: All tests pass with both versions.

josharian · 2021-05-07T13:54:15Z

No need to check uses or to clobber; your simple version suffices. BSWAP works on a regular value (held in a register), not directly on memory.

Feel free to stop at the double BSWAP elimination—eliminating the bitwise operations is, as I said, a bonus. :)

Are there other architectures to apply it to? Worth doing BSWAPL?

Note the 1.17 freeze has begun. Feel free to mail your changes, but they won't go in until the 1.18 cycle.

agarciamontoro · 2021-05-07T14:23:22Z

No need to check uses or to clobber; your simple version suffices. BSWAP works on a regular value (held in a register), not directly on memory.

Cool, thanks for the info!

Feel free to stop at the double BSWAP elimination—eliminating the bitwise operations is, as I said, a bonus. :)

I actually have the unary ^ kind of solved, too. I was wondering what the other cases are. You talk about some binary bitwise operations, but between what values? Should I test that with another array argument similar to x?

Are there other architectures to apply it to?

That's my next step, I wanted to take a look!

Worth doing BSWAPL?

Yup, I was actually thinking about doing something a bit more general, like

(BSWAP(Q|L) (BSWAP(Q|L) p)) => p

Note the 1.17 freeze has begun. Feel free to mail your changes, but they won't go in until the 1.18 cycle.

I konw, thanks for the heads-up!

josharian · 2021-05-07T14:33:20Z

You talk about some binary bitwise operations, but between what values? Should I test that with another array argument similar to x?

Yes, exactly.

#31586 will probably be useful context for this.

agarciamontoro · 2021-05-14T17:33:57Z

I'm sending a CL covering only the double swap case. I added rules for both amd64 and arm64. I am not familiar at all with s390x nor with ppc64le, but I think the double swap was not happening there. The test does cover s390x (this should be reviewed with even more care, though), but I had no clue on what to do with ppc64le. We can discuss in the CL itself, if that works for you.

Regarding the other cases (the unary and binary bitwise operations): should we open a new issue for them? I can probably start taking a deeper look next Friday.

gopherbot · 2021-05-14T17:34:53Z

Change https://golang.org/cl/320073 mentions this issue: cmd/compile: eliminate successive swaps

agarciamontoro · 2021-10-05T15:24:49Z

@josharian, should we add this to the 1.18 milestone? I'm planning on finishing the work here this or next week.

josharian · 2021-10-05T16:50:18Z

@agarciamontoro this won't block the 1.18 milestone, so there's no need, but thanks for checking in. (And for working on it.)

josharian added help wanted Performance labels Sep 28, 2020

josharian added this to the Unplanned milestone Sep 28, 2020

ianlancetaylor changed the title ~~cmd/compile: combine BWAPS on amd64~~ cmd/compile: combine BSWAP on amd64 Sep 28, 2020

andybons added the NeedsFix The path to resolution is known, but the work has not been done. label Sep 29, 2020

danderson mentioned this issue Dec 25, 2020

Change the representation of IP to a pair of uint64s. inetaf/netaddr#63

Merged

gopherbot closed this as completed in e1c294a Oct 9, 2021

golang locked and limited conversation to collaborators Oct 9, 2022

gopherbot added the FrozenDueToAge label Oct 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: combine BSWAP on amd64 #41684

cmd/compile: combine BSWAP on amd64 #41684

josharian commented Sep 28, 2020

agarciamontoro commented Dec 30, 2020

josharian commented Dec 31, 2020

agarciamontoro commented May 7, 2021 •

edited

Loading

josharian commented May 7, 2021

agarciamontoro commented May 7, 2021 •

edited

Loading

josharian commented May 7, 2021

agarciamontoro commented May 14, 2021

gopherbot commented May 14, 2021

agarciamontoro commented Oct 5, 2021

josharian commented Oct 5, 2021

cmd/compile: combine BSWAP on amd64 #41684

cmd/compile: combine BSWAP on amd64 #41684

Comments

josharian commented Sep 28, 2020

agarciamontoro commented Dec 30, 2020

josharian commented Dec 31, 2020

agarciamontoro commented May 7, 2021 • edited Loading

josharian commented May 7, 2021

agarciamontoro commented May 7, 2021 • edited Loading

josharian commented May 7, 2021

agarciamontoro commented May 14, 2021

gopherbot commented May 14, 2021

agarciamontoro commented Oct 5, 2021

josharian commented Oct 5, 2021

agarciamontoro commented May 7, 2021 •

edited

Loading

agarciamontoro commented May 7, 2021 •

edited

Loading