cmd/compile: optimize unaligned load-XOR-store on byte slices #25111

mundaym · 2018-04-26T18:48:13Z

It would be nice if the following sets of code were equivalent on platforms that support unaligned loads/stores (386, amd64, arm64, ppc64le, s390x...). I've used XOR in these examples but it is also true for the other logical operators:

(1)

binary.LittleEndian.PutUint32(dst, binary.LittleEndian.Uint32(src) ^ x)

(2)

dst[0] = src[0] ^ byte(x)
dst[1] = src[1] ^ byte(x>>8)
dst[2] = src[2] ^ byte(x>>16)
dst[3] = src[3] ^ byte(x>>24)

(3) [less important]

x ^= uint32(src[0])
x ^= uint32(src[1]) << 8
x ^= uint32(src[2]) << 16
x ^= uint32(src[3]) << 24
binary.LittleEndian.PutUint32(dst, x)

Currently (1) is optimal on platforms with unaligned loads and (2) is optimal on other platforms. It would be nice if the compiler could optimize (2) into (1). I've added (3) as an additional case where the current rules are suboptimal.

If this is ever done it will help simplify the generic golang.org/x/crypto/internal/chacha20 implementation.

The text was updated successfully, but these errors were encountered:

bcmills · 2018-04-27T19:08:16Z

I think I'm missing something: why is (1) not optimal on other platforms?

mundaym · 2018-04-27T21:45:05Z

Example (1) is equivalent to the following code which contains an extra 3 shifts:

v := uint32(src[0])
v |= uint32(src[1]) << 8
v |= uint32(src[2]) << 16
v |= uint32(src[3]) << 24
v ^= u
dst[0] = byte(v)
dst[1] = byte(v>>8)
dst[2] = byte(v>>16)
dst[3] = byte(v>>24)

On the other hand this doesn't actually result in many more instructions on arm because of the shifted register inputs. The assembly on mips benefits a bit more from (2) though. I don't know if there is a speed difference.

bcmills · 2018-04-27T21:56:03Z

It seems like (1) is the code we should expect people to write, and it may be a bit easier to recognize and rewrite than (2). (It's usually easier to distribute mathematical operations than to un-distribute them, especially binary operations that don't carry.)

mundaym added the Performance label Apr 26, 2018

mundaym added this to the Unplanned milestone Apr 26, 2018

bcmills mentioned this issue Mar 11, 2020

crypto/cipher: move xorBytes function to internal #28465

Closed

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022

mknyszek added this to Go Compiler / Runtime Jul 13, 2022

mknyszek removed this from Go Compiler / Runtime Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: optimize unaligned load-XOR-store on byte slices #25111

cmd/compile: optimize unaligned load-XOR-store on byte slices #25111

mundaym commented Apr 26, 2018

bcmills commented Apr 27, 2018

mundaym commented Apr 27, 2018

bcmills commented Apr 27, 2018

cmd/compile: optimize unaligned load-XOR-store on byte slices #25111

cmd/compile: optimize unaligned load-XOR-store on byte slices #25111

Comments

mundaym commented Apr 26, 2018

bcmills commented Apr 27, 2018

mundaym commented Apr 27, 2018

bcmills commented Apr 27, 2018