Skip to content

cmd/compile: optimize unaligned load-XOR-store on byte slices #25111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mundaym opened this issue Apr 26, 2018 · 3 comments
Open

cmd/compile: optimize unaligned load-XOR-store on byte slices #25111

mundaym opened this issue Apr 26, 2018 · 3 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Performance
Milestone

Comments

@mundaym
Copy link
Member

mundaym commented Apr 26, 2018

It would be nice if the following sets of code were equivalent on platforms that support unaligned loads/stores (386, amd64, arm64, ppc64le, s390x...). I've used XOR in these examples but it is also true for the other logical operators:

(1)

binary.LittleEndian.PutUint32(dst, binary.LittleEndian.Uint32(src) ^ x)

(2)

dst[0] = src[0] ^ byte(x)
dst[1] = src[1] ^ byte(x>>8)
dst[2] = src[2] ^ byte(x>>16)
dst[3] = src[3] ^ byte(x>>24)

(3) [less important]

x ^= uint32(src[0])
x ^= uint32(src[1]) << 8
x ^= uint32(src[2]) << 16
x ^= uint32(src[3]) << 24
binary.LittleEndian.PutUint32(dst, x)

Currently (1) is optimal on platforms with unaligned loads and (2) is optimal on other platforms. It would be nice if the compiler could optimize (2) into (1). I've added (3) as an additional case where the current rules are suboptimal.

If this is ever done it will help simplify the generic golang.org/x/crypto/internal/chacha20 implementation.

@mundaym mundaym added this to the Unplanned milestone Apr 26, 2018
@bcmills
Copy link
Contributor

bcmills commented Apr 27, 2018

I think I'm missing something: why is (1) not optimal on other platforms?

@mundaym
Copy link
Member Author

mundaym commented Apr 27, 2018

Example (1) is equivalent to the following code which contains an extra 3 shifts:

v := uint32(src[0])
v |= uint32(src[1]) << 8
v |= uint32(src[2]) << 16
v |= uint32(src[3]) << 24
v ^= u
dst[0] = byte(v)
dst[1] = byte(v>>8)
dst[2] = byte(v>>16)
dst[3] = byte(v>>24)

On the other hand this doesn't actually result in many more instructions on arm because of the shifted register inputs. The assembly on mips benefits a bit more from (2) though. I don't know if there is a speed difference.

@bcmills
Copy link
Contributor

bcmills commented Apr 27, 2018

It seems like (1) is the code we should expect people to write, and it may be a bit easier to recognize and rewrite than (2). (It's usually easier to distribute mathematical operations than to un-distribute them, especially binary operations that don't carry.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Performance
Projects
None yet
Development

No branches or pull requests

3 participants