-
Notifications
You must be signed in to change notification settings - Fork 18k
crypto/aes: implement ctrAble when using AES-NI on amd64 #20967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @Yawning! You might know this, but for context on the issue: I'd be happy to review a CL adding assembly to optimize CTR, and I assume it would be merged, but someone needs to do the work. Let us know if you'd like to. I suggest adding "amd64" to the title. Label suggestions: NeedsFix HelpWanted Performance. |
Sounds fun. I'd be interested in taking this on. |
It's not taken, I'd be happy to review your CL!
|
@Yawning or anyone else: I don't want to duplicate work, so let me know if you've made any progress in this direction. I see some work here on AES https://git.schwanenlied.me/yawning/bsaes.git. You also suggest in katzenpost/core#1 that you're going to add the AES-NI instructions to that package. I actually quite enjoy messing around with assembly every now and then :) As you say, since GCM has been done this shouldn't be hard in theory. Famous last words. |
@mmcloughlin Don't hold off on doing this on my account. I haven't gotten around to it yet, and the way I end up integrating it into my AES package will probably be quite different from something that's upstreamable. |
Thanks for clarifying @Yawning |
I have just spent longer than I should investigating a concern I had about The function My concern was that GCM also requires CLMUL which is indicated by bit 1 of the same CPUINFO word. Note that Anyway it seemed to me there might be an issue if there is a processor with AES available and CLMUL off. So I looked to see if that ever happens. I searched through a database of processor CPUIDs and it turns out it doesn't happen: https://gist.github.com/mmcloughlin/66488e42a8fdbd9ab39c3f6438bb8ed7 |
Some experiments on performance of encrypting multiple blocks concurrently with the same key: https://github.com/mmcloughlin/aesnix The work by Gueron et. al. referenced in gcm_amd64.s suggests 8 concurrent encryptions is optimal. The experiments I've run generally tend to agree, but results for 4+ appear more or less the same. |
How many concurrent encryptions is optimal depends on the processor.
Intel Celeron N3050
|
I have tried to optimize AES-CTR on AMD64. The result is:
|
Change https://golang.org/cl/51670 mentions this issue: |
Change https://golang.org/cl/51790 mentions this issue: |
I compared the two CLs for speed on my MacBook Pro 3.1 GHz Intel Core i5. 51f9e92 - CL 51670 by @crvv
1d4a2c8 - CL 51790 by
CL 51670 is ~7-30% faster. I also marginally prefer the assembly macros and not touching the non-amd64 files. So unless I'm missing something I'd suggest focusing on that one.
|
@FiloSottile I did some early work on this but stopped when I saw other people had taken it. The two CLs were from @crvv and Ilya Tocar, who does not appear to be on this thread. |
Ouch, this is what I get for writing replies on planes. @TocarIP ^ |
I want to use this patch with golang 1.9 right now. |
@FiloSottile - CL 51670 has a +1 now and it is waiting for a final +2 from you. Would you be able to take a look ? I ask because I would love to add a AVX2 flavor to this and see what sort of performance improvements we get. |
Change https://golang.org/cl/136896 mentions this issue: |
CL https://golang.org/cl/136896 brings CTR mode performance in line with GCM, as you would expect.
@FiloSottile please let me know what would make this CL easiest to verify and review. |
Change https://golang.org/cl/136897 mentions this issue: |
Some parallelizable cipher modes may achieve peak performance for larger block sizes. For this reason the AES-GCM mode already has an 8K benchmark alongside the 1K version. This change introduces 8K benchmarks for additional AES stream cipher modes. Updates #20967 Change-Id: If97c6fbf31222602dcc200f8f418d95908ec1202 Reviewed-on: https://go-review.googlesource.com/136897 Reviewed-by: Brad Fitzpatrick <[email protected]> Reviewed-by: Filippo Valsorda <[email protected]> Run-TryBot: Brad Fitzpatrick <[email protected]> TryBot-Result: Gobot Gobot <[email protected]>
Hi, after applying the patch here: https://go-review.googlesource.com/c/go/+/51670 SFTPGo (https://github.com/drakkan/sftpgo) has a 20% performance improvement when using an AES CTR based cipher, can you please merge? Thanks! |
I wrote fast implementation of seekable AES-CTR for amd64 producing the same stream as standard crypto/cipher.NewCTR. This implementation supports passing arbitrary offset, which is useful to make IO in the middle of a file. Reviews, bug reports, and feedback are welcome. https://github.com/starius/aesctrat I used the idea from https://github.com/mmcloughlin/aesnix of processing multiple blocks in the same ASM. On my machines (desktop Ryzen 5 and VPS) it processes ~5000 megabytes per second. |
The implementation runs up to 8 AES instructions in different registers one after another in ASM code. Because CPU has instruction pipelining and the instructions do not depend on each other, they can run in parallel with this layout of code. This results in significant speedup compared to the regular implementation in which blocks are processed in the same registers so AES instructions do not run in parallel. GCM mode already utilizes the approach. The type implementing ctrAble has most of its code in XORKeyStreamAt() method which has an additional argument, offset. It allows to use it in a stateless way and to jump to any location in the stream. AES CTR benchmark delta. $ go test crypto/cipher -bench 'BenchmarkAESCTR*' AMD64. Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz name old time/op new time/op delta BenchmarkAESCTR1K-2 1259ns 266.9ns -78.8% BenchmarkAESCTR8K-2 9859ns 1953ns -80.1% ARM64. ARM Neoverse-N1 (AWS EC2 t4g.small instance) name old time/op new time/op delta BenchmarkAESCTR1K-2 1098ns 481.1ns -56.2% BenchmarkAESCTR8K-2 8447ns 3452ns -59.1% Original issue: golang#20967 Investigation and initial implementation: https://github.com/mmcloughlin/aesnix/ Full implementation in external repo: https://github.com/starius/aesctrat
The implementation runs up to 8 AES instructions in different registers one after another in ASM code. Because CPU has instruction pipelining and the instructions do not depend on each other, they can run in parallel with this layout of code. This results in significant speedup compared to the regular implementation in which blocks are processed in the same registers so AES instructions do not run in parallel. GCM mode already utilizes the approach. The type implementing ctrAble has most of its code in XORKeyStreamAt() method which has an additional argument, offset. It allows to use it in a stateless way and to jump to any location in the stream. AES CTR benchmark delta. $ go test crypto/cipher -bench 'BenchmarkAESCTR*' AMD64. Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz name old time/op new time/op delta BenchmarkAESCTR1K-2 1259ns 266.9ns -78.8% BenchmarkAESCTR8K-2 9859ns 1953ns -80.1% ARM64. ARM Neoverse-N1 (AWS EC2 t4g.small instance) name old time/op new time/op delta BenchmarkAESCTR1K-2 1098ns 481.1ns -56.2% BenchmarkAESCTR8K-2 8447ns 3452ns -59.1% Original issue: golang#20967 Investigation and initial implementation: https://github.com/mmcloughlin/aesnix/ Full implementation in external repo: https://github.com/starius/aesctrat
I added the multiblock implementation to std lib for amd64 and arm64 in #53503 It boosts performance 2-13x depending on machine. |
Change https://go.dev/cl/413594 mentions this issue: |
The implementation runs up to 8 AES instructions in different registers one after another in ASM code. Because CPU has instruction pipelining and the instructions do not depend on each other, they can run in parallel with this layout of code. This results in significant speedup compared to the regular implementation in which blocks are processed in the same registers so AES instructions do not run in parallel. GCM mode already utilizes the approach. The type implementing ctrAble has most of its code in XORKeyStreamAt() method which has an additional argument, offset. It allows to use it in a stateless way and to jump to any location in the stream. AES CTR benchmark delta. $ go test crypto/cipher -bench 'BenchmarkAESCTR*' AMD64. Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz name old time/op new time/op delta BenchmarkAESCTR1K-2 1259ns 266.9ns -78.8% BenchmarkAESCTR8K-2 9859ns 1953ns -80.1% ARM64. ARM Neoverse-N1 (AWS EC2 t4g.small instance) name old time/op new time/op delta BenchmarkAESCTR1K-2 1098ns 481.1ns -56.2% BenchmarkAESCTR8K-2 8447ns 3452ns -59.1% Original issue: golang#20967 Investigation and initial implementation: https://github.com/mmcloughlin/aesnix/ Full implementation in external repo: https://github.com/starius/aesctrat
The implementation runs up to 8 AES instructions in different registers one after another in ASM code. Because CPU has instruction pipelining and the instructions do not depend on each other, they can run in parallel with this layout of code. This results in significant speedup compared to the regular implementation in which blocks are processed in the same registers so AES instructions do not run in parallel. GCM mode already utilizes the approach. The type implementing ctrAble has most of its code in XORKeyStreamAt() method which has an additional argument, offset. It allows to use it in a stateless way and to jump to any location in the stream. AES CTR benchmark delta. $ go test crypto/cipher -bench 'BenchmarkAESCTR*' AMD64. Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz name old time/op new time/op delta BenchmarkAESCTR1K-2 1259ns 266.9ns -78.8% BenchmarkAESCTR8K-2 9859ns 1953ns -80.1% ARM64. ARM Neoverse-N1 (AWS EC2 t4g.small instance) name old time/op new time/op delta BenchmarkAESCTR1K-2 1098ns 481.1ns -56.2% BenchmarkAESCTR8K-2 8447ns 3452ns -59.1% Original issue: golang#20967 Investigation and initial implementation: https://github.com/mmcloughlin/aesnix/ Full implementation in external repo: https://github.com/starius/aesctrat
The implementation runs up to 8 AES instructions in different registers one after another in ASM code. Because CPU has instruction pipelining and the instructions do not depend on each other, they can run in parallel with this layout of code. This results in significant speedup compared to the regular implementation in which blocks are processed in the same registers so AES instructions do not run in parallel. GCM mode already utilizes the approach. The type implementing ctrAble has most of its code in XORKeyStreamAt() method which has an additional argument, offset. It allows to use it in a stateless way and to jump to any location in the stream. AES CTR benchmark delta. $ go test crypto/cipher -bench 'BenchmarkAESCTR*' AMD64. Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz name old time/op new time/op delta BenchmarkAESCTR1K-2 1259ns 266.9ns -78.8% BenchmarkAESCTR8K-2 9859ns 1953ns -80.1% ARM64. ARM Neoverse-N1 (AWS EC2 t4g.small instance) name old time/op new time/op delta BenchmarkAESCTR1K-2 1098ns 481.1ns -56.2% BenchmarkAESCTR8K-2 8447ns 3452ns -59.1% Original issue: golang#20967 Investigation and initial implementation: https://github.com/mmcloughlin/aesnix/ Full implementation in external repo: https://github.com/starius/aesctrat
The implementation runs up to 8 AES instructions in different registers one after another in ASM code. Because CPU has instruction pipelining and the instructions do not depend on each other, they can run in parallel with this layout of code. This results in significant speedup compared to the regular implementation in which blocks are processed in the same registers so AES instructions do not run in parallel. GCM mode already utilizes the approach. The type implementing ctrAble has most of its code in XORKeyStreamAt() method which has an additional argument, offset. It allows to use it in a stateless way and to jump to any location in the stream. AES CTR benchmark delta. $ go test crypto/cipher -bench 'BenchmarkAESCTR*' AMD64. Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz name old time/op new time/op delta BenchmarkAESCTR1K-2 1259ns 266.9ns -78.8% BenchmarkAESCTR8K-2 9859ns 1953ns -80.1% ARM64. ARM Neoverse-N1 (AWS EC2 t4g.small instance) name old time/op new time/op delta BenchmarkAESCTR1K-2 1098ns 481.1ns -56.2% BenchmarkAESCTR8K-2 8447ns 3452ns -59.1% Original issue: golang#20967 Investigation and initial implementation: https://github.com/mmcloughlin/aesnix/ Full implementation in external repo: https://github.com/starius/aesctrat
The implementation runs up to 8 AES instructions in different registers one after another in ASM code. Because CPU has instruction pipelining and the instructions do not depend on each other, they can run in parallel with this layout of code. This results in significant speedup compared to the regular implementation in which blocks are processed in the same registers so AES instructions do not run in parallel. GCM mode already utilizes the approach. The type implementing ctrAble in ASM has most of its code in XORKeyStreamAt method which has an additional argument, offset. It allows to use it in a stateless way and to jump to any location in the stream. The method does not exist in pure Go and boringcrypto implementations. AES CTR benchmark delta. $ go test crypto/cipher -bench 'BenchmarkAESCTR*' AMD64. Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz name old time/op new time/op delta BenchmarkAESCTR1K-2 1259ns 266.9ns -78.8% BenchmarkAESCTR8K-2 9859ns 1953ns -80.1% ARM64. ARM Neoverse-N1 (AWS EC2 t4g.small instance) name old time/op new time/op delta BenchmarkAESCTR1K-2 1098ns 481.1ns -56.2% BenchmarkAESCTR8K-2 8447ns 3452ns -59.1% Original issue: golang#20967 Investigation and initial implementation: https://github.com/mmcloughlin/aesnix/ Full implementation in external repo: https://github.com/starius/aesctrat
The implementation runs up to 8 AES instructions in different registers one after another in ASM code. Because CPU has instruction pipelining and the instructions do not depend on each other, they can run in parallel with this layout of code. This results in significant speedup compared to the regular implementation in which blocks are processed in the same registers so AES instructions do not run in parallel. GCM mode already utilizes the approach. The type implementing ctrAble in ASM has most of its code in XORKeyStreamAt method which has an additional argument, offset. It allows to use it in a stateless way and to jump to any location in the stream. The method does not exist in pure Go and boringcrypto implementations. AES CTR benchmark delta. $ go test crypto/cipher -bench 'BenchmarkAESCTR*' AMD64. Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz name old time/op new time/op delta BenchmarkAESCTR1K-2 1259ns 266.9ns -78.8% BenchmarkAESCTR8K-2 9859ns 1953ns -80.1% ARM64. ARM Neoverse-N1 (AWS EC2 t4g.small instance) name old time/op new time/op delta BenchmarkAESCTR1K-2 1098ns 481.1ns -56.2% BenchmarkAESCTR8K-2 8447ns 3452ns -59.1% Original issue: golang#20967 Investigation and initial implementation: https://github.com/mmcloughlin/aesnix/ Full implementation in external repo: https://github.com/starius/aesctrat
Change https://go.dev/cl/621958 mentions this issue: |
What version of Go are you using (
go version
)?1.8.3
What operating system and processor architecture are you using (
go env
)?linux/amd64
What did you do?
Benchmarked CTR-AES128 backed by
crypto/aes
on a system with AES-NI.What did you expect to see?
Acceptable performance.
What did you see instead?
~400 MB/s for 1 KiB writes. For reference the same system using
crypto/aes
'sgcmAble
does GCM-AES128 Seal() at ~1300 MB/s. A cursory look at the source suggests that there is no special casectrAble
implementation for AES-NI.The text was updated successfully, but these errors were encountered: