crypto/aes: implement ctrAble when using AES-NI on amd64 #20967

Yawning · 2017-07-10T07:19:29Z

What version of Go are you using (`go version`)?

1.8.3

What operating system and processor architecture are you using (`go env`)?

linux/amd64

What did you do?

Benchmarked CTR-AES128 backed by crypto/aes on a system with AES-NI.

What did you expect to see?

Acceptable performance.

What did you see instead?

~400 MB/s for 1 KiB writes. For reference the same system using crypto/aes's gcmAble does GCM-AES128 Seal() at ~1300 MB/s. A cursory look at the source suggests that there is no special case ctrAble implementation for AES-NI.

The text was updated successfully, but these errors were encountered:

mvdan · 2017-07-10T11:55:27Z

/cc @agl @rsc

FiloSottile · 2017-07-10T12:00:40Z

Hi @Yawning!

You might know this, but for context on the issue: ctrAble and gcmAble are interfaces to break the cipher.Block abstraction. AES-CTR on linux/amd64 does use AES-NI for block encryption, but the block mode is handled in Go, which is of course slower, as opposed to GCM or CTR on s390x.

I'd be happy to review a CL adding assembly to optimize CTR, and I assume it would be merged, but someone needs to do the work. Let us know if you'd like to.

I suggest adding "amd64" to the title. Label suggestions: NeedsFix HelpWanted Performance.

mmcloughlin · 2017-07-18T07:49:41Z

Sounds fun. I'd be interested in taking this on.

FiloSottile · 2017-07-18T14:05:07Z

It's not taken, I'd be happy to review your CL!

mmcloughlin · 2017-07-18T21:27:21Z

@Yawning or anyone else: I don't want to duplicate work, so let me know if you've made any progress in this direction.

I see some work here on AES https://git.schwanenlied.me/yawning/bsaes.git. You also suggest in katzenpost/core#1 that you're going to add the AES-NI instructions to that package.

I actually quite enjoy messing around with assembly every now and then :) As you say, since GCM has been done this shouldn't be hard in theory. Famous last words.

Yawning · 2017-07-19T08:29:06Z

@mmcloughlin Don't hold off on doing this on my account. I haven't gotten around to it yet, and the way I end up integrating it into my AES package will probably be quite different from something that's upstreamable.

mmcloughlin · 2017-07-19T16:55:24Z

Thanks for clarifying @Yawning

mmcloughlin · 2017-07-22T03:47:47Z

I have just spent longer than I should investigating a concern I had about cipherhw.AESGCMSupport. Ultimately not fruitful, but I'll document it here in case someone somewhere on the internet ever has the same question.

The function cipherhw.AESGCMSupport delegates to hasAESNI. This checks bit 25 of CPUINFO (leaf 1 ecx).

My concern was that GCM also requires CLMUL which is indicated by bit 1 of the same CPUINFO word. Note that aes.hasGCMAsm checks both bits. It is not clear why we need both cipherhw.AESGCMSupport and aes.hasGCMAsm.

Anyway it seemed to me there might be an issue if there is a processor with AES available and CLMUL off. So I looked to see if that ever happens. I searched through a database of processor CPUIDs and it turns out it doesn't happen:

https://gist.github.com/mmcloughlin/66488e42a8fdbd9ab39c3f6438bb8ed7

mmcloughlin · 2017-07-25T01:30:31Z

Some experiments on performance of encrypting multiple blocks concurrently with the same key:

https://github.com/mmcloughlin/aesnix

The work by Gueron et. al. referenced in gcm_amd64.s suggests 8 concurrent encryptions is optimal. The experiments I've run generally tend to agree, but results for 4+ appear more or less the same.

crvv · 2017-07-28T07:58:06Z

How many concurrent encryptions is optimal depends on the processor.
I tested mmcloughlin's code on two different CPUs. The result is:
Intel Core i7-4770HQ

BenchmarkSingle-8   	200000000	         8.69 ns/op	1841.41 MB/s
BenchmarkMulti/2-8  	100000000	        10.4 ns/op	3063.63 MB/s
BenchmarkMulti/4-8  	100000000	        17.5 ns/op	3661.52 MB/s
BenchmarkMulti/6-8  	100000000	        22.9 ns/op	4193.59 MB/s
BenchmarkMulti/8-8  	50000000	        28.6 ns/op	4476.84 MB/s
BenchmarkMulti/10-8 	50000000	        35.3 ns/op	4529.11 MB/s
BenchmarkMulti/12-8 	30000000	        40.5 ns/op	4740.57 MB/s
BenchmarkMulti/14-8 	30000000	        45.2 ns/op	4955.40 MB/s

Intel Celeron N3050

BenchmarkSingle-2   	30000000	        47.1 ns/op	 339.35 MB/s
BenchmarkMulti/2-2  	20000000	        70.3 ns/op	 455.03 MB/s
BenchmarkMulti/4-2  	10000000	       117 ns/op	 543.86 MB/s
BenchmarkMulti/6-2  	10000000	       168 ns/op	 571.04 MB/s
BenchmarkMulti/8-2  	 5000000	       326 ns/op	 391.92 MB/s
BenchmarkMulti/10-2 	 3000000	       400 ns/op	 399.27 MB/s
BenchmarkMulti/12-2 	 3000000	       480 ns/op	 399.73 MB/s
BenchmarkMulti/14-2 	 3000000	       557 ns/op	 402.03 MB/s

crvv · 2017-07-28T09:58:16Z

I have tried to optimize AES-CTR on AMD64. The result is:

Encrypt the counter values in bulk.
crvv@f33f081.

name        old speed     new speed      delta
AESCTR1K-8  520MB/s ± 1%  1210MB/s ± 2%  +132.58%

Implement big-endian 128 bit integer addition in assembly.
crvv@8b203f5

name        old speed     new speed      delta
AESCTR1K-8  520MB/s ± 1%  2014MB/s ± 3%  +286.96%

merge 1 and 2. This is not significant.
crvv@cb9cb27

name        old speed     new speed      delta
AESCTR1K-8  520MB/s ± 1%  2104MB/s ± 1%  +304.31%

Like ctr_s390x.go, Implement XOR in assembly.
crvv@958e355

name        old speed     new speed      delta
AESCTR1K-8  520MB/s ± 1%  2505MB/s ± 3%  +381.43%

gopherbot · 2017-07-28T10:03:05Z

Change https://golang.org/cl/51670 mentions this issue: crypto/aes: add optimized implementation of AES-CTR for AMD64

gopherbot · 2017-07-28T19:45:30Z

Change https://golang.org/cl/51790 mentions this issue: crypto/aes: add encryptMany and use it to speed up ctr

FiloSottile · 2017-08-30T18:49:37Z

I compared the two CLs for speed on my MacBook Pro 3.1 GHz Intel Core i5.

51f9e92 - CL 51670 by @crvv

name        old time/op   new time/op    delta
AESCTR1K-4   1.37µs ± 2%    0.37µs ± 1%   -73.30%  (p=0.002 n=6+6)
AESCTR32-4   57.4ns ± 1%    21.1ns ± 3%   -63.14%  (p=0.002 n=6+6)

name        old speed     new speed      delta
AESCTR1K-4  742MB/s ± 2%  2776MB/s ± 1%  +274.21%  (p=0.002 n=6+6)
AESCTR32-4  558MB/s ± 1%  1514MB/s ± 3%  +171.39%  (p=0.002 n=6+6)

1d4a2c8 - CL 51790 by ~~@mmcloughlin~~ @TocarIP

name        old time/op   new time/op    delta
AESCTR1K-4   1.37µs ± 2%    0.48µs ± 2%   -64.81%  (p=0.002 n=6+6)
AESCTR32-4   57.4ns ± 1%    22.6ns ± 1%   -60.53%  (p=0.002 n=6+6)

name        old speed     new speed      delta
AESCTR1K-4  742MB/s ± 2%  2107MB/s ± 2%  +184.03%  (p=0.002 n=6+6)
AESCTR32-4  558MB/s ± 1%  1413MB/s ± 1%  +153.20%  (p=0.002 n=6+6)

CL 51670 is ~7-30% faster. I also marginally prefer the assembly macros and not touching the non-amd64 files. So unless I'm missing something I'd suggest focusing on that one.

~~@mmcloughlin~~ @TocarIP Sorry you folks duplicated work :( ~~I would of course appreciate it if you could help review.~~ You already did, thanks, my bad! (But a +1 would be awesome.)

mmcloughlin · 2017-08-30T19:00:24Z

@FiloSottile I did some early work on this but stopped when I saw other people had taken it. The two CLs were from @crvv and Ilya Tocar, who does not appear to be on this thread.

FiloSottile · 2017-08-30T19:48:03Z

Ouch, this is what I get for writing replies on planes.

@TocarIP ^

bronze1man · 2017-12-29T15:39:54Z

I want to use this patch with golang 1.9 right now.
So i fork the CL 51670 and make it a standalone package. I think it may help others.
https://github.com/bronze1man/AesCtr

agnivade · 2018-03-26T18:27:55Z

@FiloSottile - CL 51670 has a +1 now and it is waiting for a final +2 from you.

Would you be able to take a look ? I ask because I would love to add a AVX2 flavor to this and see what sort of performance improvements we get.

gopherbot · 2018-09-24T02:48:04Z

Change https://golang.org/cl/136896 mentions this issue: crypto/aes: optimize AES-CTR mode on amd64

mmcloughlin · 2018-09-24T02:52:54Z

CL https://golang.org/cl/136896 brings CTR mode performance in line with GCM, as you would expect.

$ ./bin/go test -bench '(GCM|CTR).*1K' crypto/cipher
goos: darwin
goarch: amd64
pkg: crypto/cipher
BenchmarkAESGCMSeal1K-4   	 5000000	       269 ns/op	3803.33 MB/s
BenchmarkAESGCMOpen1K-4   	 5000000	       244 ns/op	4189.12 MB/s
BenchmarkAESCTR1K-4       	10000000	       230 ns/op	4426.85 MB/s
PASS
ok  	crypto/cipher	6.357s

@FiloSottile please let me know what would make this CL easiest to verify and review.

gopherbot · 2018-09-24T03:36:49Z

Change https://golang.org/cl/136897 mentions this issue: crypto/cipher: 8K benchmarks for AES stream modes

Some parallelizable cipher modes may achieve peak performance for larger block sizes. For this reason the AES-GCM mode already has an 8K benchmark alongside the 1K version. This change introduces 8K benchmarks for additional AES stream cipher modes. Updates #20967 Change-Id: If97c6fbf31222602dcc200f8f418d95908ec1202 Reviewed-on: https://go-review.googlesource.com/136897 Reviewed-by: Brad Fitzpatrick <[email protected]> Reviewed-by: Filippo Valsorda <[email protected]> Run-TryBot: Brad Fitzpatrick <[email protected]> TryBot-Result: Gobot Gobot <[email protected]>

drakkan · 2020-02-12T09:27:01Z

Hi, after applying the patch here:

https://go-review.googlesource.com/c/go/+/51670

SFTPGo (https://github.com/drakkan/sftpgo) has a 20% performance improvement when using an AES CTR based cipher, can you please merge? Thanks!

starius · 2021-11-22T02:48:10Z

I wrote fast implementation of seekable AES-CTR for amd64 producing the same stream as standard crypto/cipher.NewCTR. This implementation supports passing arbitrary offset, which is useful to make IO in the middle of a file. Reviews, bug reports, and feedback are welcome. https://github.com/starius/aesctrat

I used the idea from https://github.com/mmcloughlin/aesnix of processing multiple blocks in the same ASM.

On my machines (desktop Ryzen 5 and VPS) it processes ~5000 megabytes per second.

The implementation runs up to 8 AES instructions in different registers one after another in ASM code. Because CPU has instruction pipelining and the instructions do not depend on each other, they can run in parallel with this layout of code. This results in significant speedup compared to the regular implementation in which blocks are processed in the same registers so AES instructions do not run in parallel. GCM mode already utilizes the approach. The type implementing ctrAble has most of its code in XORKeyStreamAt() method which has an additional argument, offset. It allows to use it in a stateless way and to jump to any location in the stream. AES CTR benchmark delta. $ go test crypto/cipher -bench 'BenchmarkAESCTR*' AMD64. Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz name old time/op new time/op delta BenchmarkAESCTR1K-2 1259ns 266.9ns -78.8% BenchmarkAESCTR8K-2 9859ns 1953ns -80.1% ARM64. ARM Neoverse-N1 (AWS EC2 t4g.small instance) name old time/op new time/op delta BenchmarkAESCTR1K-2 1098ns 481.1ns -56.2% BenchmarkAESCTR8K-2 8447ns 3452ns -59.1% Original issue: golang#20967 Investigation and initial implementation: https://github.com/mmcloughlin/aesnix/ Full implementation in external repo: https://github.com/starius/aesctrat

starius · 2022-06-22T16:30:07Z

I added the multiblock implementation to std lib for amd64 and arm64 in #53503

It boosts performance 2-13x depending on machine.

gopherbot · 2022-06-22T16:30:59Z

Change https://go.dev/cl/413594 mentions this issue: crypto/aes: speedup CTR mode on AMD64 and ARM64

The implementation runs up to 8 AES instructions in different registers one after another in ASM code. Because CPU has instruction pipelining and the instructions do not depend on each other, they can run in parallel with this layout of code. This results in significant speedup compared to the regular implementation in which blocks are processed in the same registers so AES instructions do not run in parallel. GCM mode already utilizes the approach. The type implementing ctrAble has most of its code in XORKeyStreamAt() method which has an additional argument, offset. It allows to use it in a stateless way and to jump to any location in the stream. AES CTR benchmark delta. $ go test crypto/cipher -bench 'BenchmarkAESCTR*' AMD64. Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz name old time/op new time/op delta BenchmarkAESCTR1K-2 1259ns 266.9ns -78.8% BenchmarkAESCTR8K-2 9859ns 1953ns -80.1% ARM64. ARM Neoverse-N1 (AWS EC2 t4g.small instance) name old time/op new time/op delta BenchmarkAESCTR1K-2 1098ns 481.1ns -56.2% BenchmarkAESCTR8K-2 8447ns 3452ns -59.1% Original issue: golang#20967 Investigation and initial implementation: https://github.com/mmcloughlin/aesnix/ Full implementation in external repo: https://github.com/starius/aesctrat

The implementation runs up to 8 AES instructions in different registers one after another in ASM code. Because CPU has instruction pipelining and the instructions do not depend on each other, they can run in parallel with this layout of code. This results in significant speedup compared to the regular implementation in which blocks are processed in the same registers so AES instructions do not run in parallel. GCM mode already utilizes the approach. The type implementing ctrAble in ASM has most of its code in XORKeyStreamAt method which has an additional argument, offset. It allows to use it in a stateless way and to jump to any location in the stream. The method does not exist in pure Go and boringcrypto implementations. AES CTR benchmark delta. $ go test crypto/cipher -bench 'BenchmarkAESCTR*' AMD64. Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz name old time/op new time/op delta BenchmarkAESCTR1K-2 1259ns 266.9ns -78.8% BenchmarkAESCTR8K-2 9859ns 1953ns -80.1% ARM64. ARM Neoverse-N1 (AWS EC2 t4g.small instance) name old time/op new time/op delta BenchmarkAESCTR1K-2 1098ns 481.1ns -56.2% BenchmarkAESCTR8K-2 8447ns 3452ns -59.1% Original issue: golang#20967 Investigation and initial implementation: https://github.com/mmcloughlin/aesnix/ Full implementation in external repo: https://github.com/starius/aesctrat

gopherbot · 2024-10-26T22:13:35Z

Change https://go.dev/cl/621958 mentions this issue: crypto/aes: speedup CTR mode on AMD64 and ARM64

mvdan changed the title ~~crypto/aes: Implement ctrAble when using AES-NI.~~ crypto/aes: implement ctrAble when using AES-NI Jul 10, 2017

bradfitz added help wanted NeedsFix The path to resolution is known, but the work has not been done. Performance labels Jul 10, 2017

bradfitz changed the title ~~crypto/aes: implement ctrAble when using AES-NI~~ crypto/aes: implement ctrAble when using AES-NI on amd64 Jul 10, 2017

bradfitz added this to the Go1.10 milestone Jul 10, 2017

Yawning mentioned this issue Jul 15, 2017

sphinx/crypto: Write a fast AES-NI CTR-AES128. katzenpost/core#1

Open

FiloSottile mentioned this issue Aug 3, 2017

crypto/cipher: use SIMD to improve xor performance #21269

Closed

FiloSottile mentioned this issue Aug 28, 2017

internal/crypto: Decrypt overlap rules are a bit off restic/restic#1190

Closed

aead mentioned this issue Sep 1, 2017

crypto: understand performance differences compared to BoringSSL #21525

Open

bradfitz modified the milestones: Go1.10, Go1.11 Nov 15, 2017

bradfitz mentioned this issue Feb 7, 2018

net/http: Slow HTTPS #23727

Closed

bradfitz modified the milestones: Go1.11, Unplanned May 18, 2018

Yawning mentioned this issue Jul 20, 2018

(Go) Implement the AES/SHA2 based SIV MRAE. oasisprotocol/oasis-core#695

Merged

mundaym mentioned this issue Mar 4, 2019

proposal: bytes.Xor #30553

Closed

starius mentioned this issue Jun 22, 2022

crypto/aes: speedup CTR mode on AMD64 and ARM64 #53503

Open

drakkan mentioned this issue Jun 27, 2022

Offer Option to use HPN-SSH drakkan/sftpgo#892

Closed

dmitshur modified the milestones: Unplanned, Go1.24 Nov 19, 2024

gopherbot closed this as completed in 0240c91 Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crypto/aes: implement ctrAble when using AES-NI on amd64 #20967

crypto/aes: implement ctrAble when using AES-NI on amd64 #20967

Yawning commented Jul 10, 2017

mvdan commented Jul 10, 2017

FiloSottile commented Jul 10, 2017

mmcloughlin commented Jul 18, 2017

FiloSottile commented Jul 18, 2017 via email

mmcloughlin commented Jul 18, 2017

Yawning commented Jul 19, 2017

mmcloughlin commented Jul 19, 2017

mmcloughlin commented Jul 22, 2017

mmcloughlin commented Jul 25, 2017

crvv commented Jul 28, 2017

crvv commented Jul 28, 2017

gopherbot commented Jul 28, 2017

gopherbot commented Jul 28, 2017

FiloSottile commented Aug 30, 2017 •

edited

Loading

mmcloughlin commented Aug 30, 2017

FiloSottile commented Aug 30, 2017

bronze1man commented Dec 29, 2017

agnivade commented Mar 26, 2018

gopherbot commented Sep 24, 2018

mmcloughlin commented Sep 24, 2018

gopherbot commented Sep 24, 2018

drakkan commented Feb 12, 2020

starius commented Nov 22, 2021

starius commented Jun 22, 2022

gopherbot commented Jun 22, 2022

gopherbot commented Oct 26, 2024

crypto/aes: implement ctrAble when using AES-NI on amd64 #20967

crypto/aes: implement ctrAble when using AES-NI on amd64 #20967

Comments

Yawning commented Jul 10, 2017

What version of Go are you using (go version)?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

mvdan commented Jul 10, 2017

FiloSottile commented Jul 10, 2017

mmcloughlin commented Jul 18, 2017

FiloSottile commented Jul 18, 2017 via email

mmcloughlin commented Jul 18, 2017

Yawning commented Jul 19, 2017

mmcloughlin commented Jul 19, 2017

mmcloughlin commented Jul 22, 2017

mmcloughlin commented Jul 25, 2017

crvv commented Jul 28, 2017

crvv commented Jul 28, 2017

gopherbot commented Jul 28, 2017

gopherbot commented Jul 28, 2017

FiloSottile commented Aug 30, 2017 • edited Loading

mmcloughlin commented Aug 30, 2017

FiloSottile commented Aug 30, 2017

bronze1man commented Dec 29, 2017

agnivade commented Mar 26, 2018

gopherbot commented Sep 24, 2018

mmcloughlin commented Sep 24, 2018

gopherbot commented Sep 24, 2018

drakkan commented Feb 12, 2020

starius commented Nov 22, 2021

starius commented Jun 22, 2022

gopherbot commented Jun 22, 2022

gopherbot commented Oct 26, 2024

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

FiloSottile commented Aug 30, 2017 •

edited

Loading