Skip to content

crypto/aes: implement ctrAble when using AES-NI on amd64 #20967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Yawning opened this issue Jul 10, 2017 · 26 comments · May be fixed by #53503
Closed

crypto/aes: implement ctrAble when using AES-NI on amd64 #20967

Yawning opened this issue Jul 10, 2017 · 26 comments · May be fixed by #53503
Labels
help wanted NeedsFix The path to resolution is known, but the work has not been done. Performance
Milestone

Comments

@Yawning
Copy link

Yawning commented Jul 10, 2017

What version of Go are you using (go version)?

1.8.3

What operating system and processor architecture are you using (go env)?

linux/amd64

What did you do?

Benchmarked CTR-AES128 backed by crypto/aes on a system with AES-NI.

What did you expect to see?

Acceptable performance.

What did you see instead?

~400 MB/s for 1 KiB writes. For reference the same system using crypto/aes's gcmAble does GCM-AES128 Seal() at ~1300 MB/s. A cursory look at the source suggests that there is no special case ctrAble implementation for AES-NI.

@mvdan mvdan changed the title crypto/aes: Implement ctrAble when using AES-NI. crypto/aes: implement ctrAble when using AES-NI Jul 10, 2017
@mvdan
Copy link
Member

mvdan commented Jul 10, 2017

/cc @agl @rsc

@FiloSottile
Copy link
Contributor

Hi @Yawning!

You might know this, but for context on the issue: ctrAble and gcmAble are interfaces to break the cipher.Block abstraction. AES-CTR on linux/amd64 does use AES-NI for block encryption, but the block mode is handled in Go, which is of course slower, as opposed to GCM or CTR on s390x.

I'd be happy to review a CL adding assembly to optimize CTR, and I assume it would be merged, but someone needs to do the work. Let us know if you'd like to.

I suggest adding "amd64" to the title. Label suggestions: NeedsFix HelpWanted Performance.

@bradfitz bradfitz added help wanted NeedsFix The path to resolution is known, but the work has not been done. Performance labels Jul 10, 2017
@bradfitz bradfitz changed the title crypto/aes: implement ctrAble when using AES-NI crypto/aes: implement ctrAble when using AES-NI on amd64 Jul 10, 2017
@bradfitz bradfitz added this to the Go1.10 milestone Jul 10, 2017
@mmcloughlin
Copy link
Contributor

Sounds fun. I'd be interested in taking this on.

@FiloSottile
Copy link
Contributor

FiloSottile commented Jul 18, 2017 via email

@mmcloughlin
Copy link
Contributor

@Yawning or anyone else: I don't want to duplicate work, so let me know if you've made any progress in this direction.

I see some work here on AES https://git.schwanenlied.me/yawning/bsaes.git. You also suggest in katzenpost/core#1 that you're going to add the AES-NI instructions to that package.

I actually quite enjoy messing around with assembly every now and then :) As you say, since GCM has been done this shouldn't be hard in theory. Famous last words.

@Yawning
Copy link
Author

Yawning commented Jul 19, 2017

@mmcloughlin Don't hold off on doing this on my account. I haven't gotten around to it yet, and the way I end up integrating it into my AES package will probably be quite different from something that's upstreamable.

@mmcloughlin
Copy link
Contributor

Thanks for clarifying @Yawning

@mmcloughlin
Copy link
Contributor

I have just spent longer than I should investigating a concern I had about cipherhw.AESGCMSupport. Ultimately not fruitful, but I'll document it here in case someone somewhere on the internet ever has the same question.

The function cipherhw.AESGCMSupport delegates to hasAESNI. This checks bit 25 of CPUINFO (leaf 1 ecx).

My concern was that GCM also requires CLMUL which is indicated by bit 1 of the same CPUINFO word. Note that aes.hasGCMAsm checks both bits. It is not clear why we need both cipherhw.AESGCMSupport and aes.hasGCMAsm.

Anyway it seemed to me there might be an issue if there is a processor with AES available and CLMUL off. So I looked to see if that ever happens. I searched through a database of processor CPUIDs and it turns out it doesn't happen:

https://gist.github.com/mmcloughlin/66488e42a8fdbd9ab39c3f6438bb8ed7

@mmcloughlin
Copy link
Contributor

Some experiments on performance of encrypting multiple blocks concurrently with the same key:

https://github.com/mmcloughlin/aesnix

The work by Gueron et. al. referenced in gcm_amd64.s suggests 8 concurrent encryptions is optimal. The experiments I've run generally tend to agree, but results for 4+ appear more or less the same.

@crvv
Copy link
Contributor

crvv commented Jul 28, 2017

How many concurrent encryptions is optimal depends on the processor.
I tested mmcloughlin's code on two different CPUs. The result is:
Intel Core i7-4770HQ

BenchmarkSingle-8   	200000000	         8.69 ns/op	1841.41 MB/s
BenchmarkMulti/2-8  	100000000	        10.4 ns/op	3063.63 MB/s
BenchmarkMulti/4-8  	100000000	        17.5 ns/op	3661.52 MB/s
BenchmarkMulti/6-8  	100000000	        22.9 ns/op	4193.59 MB/s
BenchmarkMulti/8-8  	50000000	        28.6 ns/op	4476.84 MB/s
BenchmarkMulti/10-8 	50000000	        35.3 ns/op	4529.11 MB/s
BenchmarkMulti/12-8 	30000000	        40.5 ns/op	4740.57 MB/s
BenchmarkMulti/14-8 	30000000	        45.2 ns/op	4955.40 MB/s

Intel Celeron N3050

BenchmarkSingle-2   	30000000	        47.1 ns/op	 339.35 MB/s
BenchmarkMulti/2-2  	20000000	        70.3 ns/op	 455.03 MB/s
BenchmarkMulti/4-2  	10000000	       117 ns/op	 543.86 MB/s
BenchmarkMulti/6-2  	10000000	       168 ns/op	 571.04 MB/s
BenchmarkMulti/8-2  	 5000000	       326 ns/op	 391.92 MB/s
BenchmarkMulti/10-2 	 3000000	       400 ns/op	 399.27 MB/s
BenchmarkMulti/12-2 	 3000000	       480 ns/op	 399.73 MB/s
BenchmarkMulti/14-2 	 3000000	       557 ns/op	 402.03 MB/s

@crvv
Copy link
Contributor

crvv commented Jul 28, 2017

I have tried to optimize AES-CTR on AMD64. The result is:

  1. Encrypt the counter values in bulk.
    crvv@f33f081.
name        old speed     new speed      delta
AESCTR1K-8  520MB/s ± 1%  1210MB/s ± 2%  +132.58%
  1. Implement big-endian 128 bit integer addition in assembly.
    crvv@8b203f5
name        old speed     new speed      delta
AESCTR1K-8  520MB/s ± 1%  2014MB/s ± 3%  +286.96%
  1. merge 1 and 2. This is not significant.
    crvv@cb9cb27
name        old speed     new speed      delta
AESCTR1K-8  520MB/s ± 1%  2104MB/s ± 1%  +304.31%
  1. Like ctr_s390x.go, Implement XOR in assembly.
    crvv@958e355
name        old speed     new speed      delta
AESCTR1K-8  520MB/s ± 1%  2505MB/s ± 3%  +381.43%

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/51670 mentions this issue: crypto/aes: add optimized implementation of AES-CTR for AMD64

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/51790 mentions this issue: crypto/aes: add encryptMany and use it to speed up ctr

@FiloSottile
Copy link
Contributor

FiloSottile commented Aug 30, 2017

I compared the two CLs for speed on my MacBook Pro 3.1 GHz Intel Core i5.

51f9e92 - CL 51670 by @crvv

name        old time/op   new time/op    delta
AESCTR1K-4   1.37µs ± 2%    0.37µs ± 1%   -73.30%  (p=0.002 n=6+6)
AESCTR32-4   57.4ns ± 1%    21.1ns ± 3%   -63.14%  (p=0.002 n=6+6)

name        old speed     new speed      delta
AESCTR1K-4  742MB/s ± 2%  2776MB/s ± 1%  +274.21%  (p=0.002 n=6+6)
AESCTR32-4  558MB/s ± 1%  1514MB/s ± 3%  +171.39%  (p=0.002 n=6+6)

1d4a2c8 - CL 51790 by @mmcloughlin @TocarIP

name        old time/op   new time/op    delta
AESCTR1K-4   1.37µs ± 2%    0.48µs ± 2%   -64.81%  (p=0.002 n=6+6)
AESCTR32-4   57.4ns ± 1%    22.6ns ± 1%   -60.53%  (p=0.002 n=6+6)

name        old speed     new speed      delta
AESCTR1K-4  742MB/s ± 2%  2107MB/s ± 2%  +184.03%  (p=0.002 n=6+6)
AESCTR32-4  558MB/s ± 1%  1413MB/s ± 1%  +153.20%  (p=0.002 n=6+6)

CL 51670 is ~7-30% faster. I also marginally prefer the assembly macros and not touching the non-amd64 files. So unless I'm missing something I'd suggest focusing on that one.

@mmcloughlin @TocarIP Sorry you folks duplicated work :( I would of course appreciate it if you could help review. You already did, thanks, my bad! (But a +1 would be awesome.)

@mmcloughlin
Copy link
Contributor

@FiloSottile I did some early work on this but stopped when I saw other people had taken it. The two CLs were from @crvv and Ilya Tocar, who does not appear to be on this thread.

@FiloSottile
Copy link
Contributor

Ouch, this is what I get for writing replies on planes.

@TocarIP ^

@bronze1man
Copy link
Contributor

I want to use this patch with golang 1.9 right now.
So i fork the CL 51670 and make it a standalone package. I think it may help others.
https://github.com/bronze1man/AesCtr

@agnivade
Copy link
Contributor

@FiloSottile - CL 51670 has a +1 now and it is waiting for a final +2 from you.

Would you be able to take a look ? I ask because I would love to add a AVX2 flavor to this and see what sort of performance improvements we get.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/136896 mentions this issue: crypto/aes: optimize AES-CTR mode on amd64

@mmcloughlin
Copy link
Contributor

CL https://golang.org/cl/136896 brings CTR mode performance in line with GCM, as you would expect.

$ ./bin/go test -bench '(GCM|CTR).*1K' crypto/cipher
goos: darwin
goarch: amd64
pkg: crypto/cipher
BenchmarkAESGCMSeal1K-4   	 5000000	       269 ns/op	3803.33 MB/s
BenchmarkAESGCMOpen1K-4   	 5000000	       244 ns/op	4189.12 MB/s
BenchmarkAESCTR1K-4       	10000000	       230 ns/op	4426.85 MB/s
PASS
ok  	crypto/cipher	6.357s

@FiloSottile please let me know what would make this CL easiest to verify and review.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/136897 mentions this issue: crypto/cipher: 8K benchmarks for AES stream modes

gopherbot pushed a commit that referenced this issue Sep 25, 2018
Some parallelizable cipher modes may achieve peak performance for larger
block sizes. For this reason the AES-GCM mode already has an 8K
benchmark alongside the 1K version. This change introduces 8K benchmarks
for additional AES stream cipher modes.

Updates #20967

Change-Id: If97c6fbf31222602dcc200f8f418d95908ec1202
Reviewed-on: https://go-review.googlesource.com/136897
Reviewed-by: Brad Fitzpatrick <[email protected]>
Reviewed-by: Filippo Valsorda <[email protected]>
Run-TryBot: Brad Fitzpatrick <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
@drakkan
Copy link
Member

drakkan commented Feb 12, 2020

Hi, after applying the patch here:

https://go-review.googlesource.com/c/go/+/51670

SFTPGo (https://github.com/drakkan/sftpgo) has a 20% performance improvement when using an AES CTR based cipher, can you please merge? Thanks!

@starius
Copy link
Contributor

starius commented Nov 22, 2021

I wrote fast implementation of seekable AES-CTR for amd64 producing the same stream as standard crypto/cipher.NewCTR. This implementation supports passing arbitrary offset, which is useful to make IO in the middle of a file. Reviews, bug reports, and feedback are welcome. https://github.com/starius/aesctrat

I used the idea from https://github.com/mmcloughlin/aesnix of processing multiple blocks in the same ASM.

On my machines (desktop Ryzen 5 and VPS) it processes ~5000 megabytes per second.

starius added a commit to starius/go that referenced this issue Jun 22, 2022
The implementation runs up to 8 AES instructions in different registers one
after another in ASM code. Because CPU has instruction pipelining and the
instructions do not depend on each other, they can run in parallel with this
layout of code. This results in significant speedup compared to the regular
implementation in which blocks are processed in the same registers so AES
instructions do not run in parallel.

GCM mode already utilizes the approach.

The type implementing ctrAble has most of its code in XORKeyStreamAt()
method which has an additional argument, offset. It allows to use it
in a stateless way and to jump to any location in the stream.

AES CTR benchmark delta.
$ go test crypto/cipher -bench 'BenchmarkAESCTR*'

AMD64. Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
name                   old time/op    new time/op   delta
BenchmarkAESCTR1K-2       1259ns        266.9ns     -78.8%
BenchmarkAESCTR8K-2       9859ns         1953ns     -80.1%

ARM64. ARM Neoverse-N1 (AWS EC2 t4g.small instance)
name                   old time/op    new time/op   delta
BenchmarkAESCTR1K-2       1098ns        481.1ns     -56.2%
BenchmarkAESCTR8K-2       8447ns         3452ns     -59.1%

Original issue: golang#20967
Investigation and initial implementation: https://github.com/mmcloughlin/aesnix/
Full implementation in external repo: https://github.com/starius/aesctrat
starius added a commit to starius/go that referenced this issue Jun 22, 2022
The implementation runs up to 8 AES instructions in different registers one
after another in ASM code. Because CPU has instruction pipelining and the
instructions do not depend on each other, they can run in parallel with this
layout of code. This results in significant speedup compared to the regular
implementation in which blocks are processed in the same registers so AES
instructions do not run in parallel.

GCM mode already utilizes the approach.

The type implementing ctrAble has most of its code in XORKeyStreamAt()
method which has an additional argument, offset. It allows to use it
in a stateless way and to jump to any location in the stream.

AES CTR benchmark delta.
$ go test crypto/cipher -bench 'BenchmarkAESCTR*'

AMD64. Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
name                   old time/op    new time/op   delta
BenchmarkAESCTR1K-2       1259ns        266.9ns     -78.8%
BenchmarkAESCTR8K-2       9859ns         1953ns     -80.1%

ARM64. ARM Neoverse-N1 (AWS EC2 t4g.small instance)
name                   old time/op    new time/op   delta
BenchmarkAESCTR1K-2       1098ns        481.1ns     -56.2%
BenchmarkAESCTR8K-2       8447ns         3452ns     -59.1%

Original issue: golang#20967
Investigation and initial implementation: https://github.com/mmcloughlin/aesnix/
Full implementation in external repo: https://github.com/starius/aesctrat
@starius
Copy link
Contributor

starius commented Jun 22, 2022

I added the multiblock implementation to std lib for amd64 and arm64 in #53503

It boosts performance 2-13x depending on machine.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/413594 mentions this issue: crypto/aes: speedup CTR mode on AMD64 and ARM64

starius added a commit to starius/go that referenced this issue Aug 30, 2023
The implementation runs up to 8 AES instructions in different registers one
after another in ASM code. Because CPU has instruction pipelining and the
instructions do not depend on each other, they can run in parallel with this
layout of code. This results in significant speedup compared to the regular
implementation in which blocks are processed in the same registers so AES
instructions do not run in parallel.

GCM mode already utilizes the approach.

The type implementing ctrAble has most of its code in XORKeyStreamAt()
method which has an additional argument, offset. It allows to use it
in a stateless way and to jump to any location in the stream.

AES CTR benchmark delta.
$ go test crypto/cipher -bench 'BenchmarkAESCTR*'

AMD64. Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
name                   old time/op    new time/op   delta
BenchmarkAESCTR1K-2       1259ns        266.9ns     -78.8%
BenchmarkAESCTR8K-2       9859ns         1953ns     -80.1%

ARM64. ARM Neoverse-N1 (AWS EC2 t4g.small instance)
name                   old time/op    new time/op   delta
BenchmarkAESCTR1K-2       1098ns        481.1ns     -56.2%
BenchmarkAESCTR8K-2       8447ns         3452ns     -59.1%

Original issue: golang#20967
Investigation and initial implementation: https://github.com/mmcloughlin/aesnix/
Full implementation in external repo: https://github.com/starius/aesctrat
starius added a commit to starius/go that referenced this issue Sep 19, 2023
The implementation runs up to 8 AES instructions in different registers one
after another in ASM code. Because CPU has instruction pipelining and the
instructions do not depend on each other, they can run in parallel with this
layout of code. This results in significant speedup compared to the regular
implementation in which blocks are processed in the same registers so AES
instructions do not run in parallel.

GCM mode already utilizes the approach.

The type implementing ctrAble has most of its code in XORKeyStreamAt()
method which has an additional argument, offset. It allows to use it
in a stateless way and to jump to any location in the stream.

AES CTR benchmark delta.
$ go test crypto/cipher -bench 'BenchmarkAESCTR*'

AMD64. Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
name                   old time/op    new time/op   delta
BenchmarkAESCTR1K-2       1259ns        266.9ns     -78.8%
BenchmarkAESCTR8K-2       9859ns         1953ns     -80.1%

ARM64. ARM Neoverse-N1 (AWS EC2 t4g.small instance)
name                   old time/op    new time/op   delta
BenchmarkAESCTR1K-2       1098ns        481.1ns     -56.2%
BenchmarkAESCTR8K-2       8447ns         3452ns     -59.1%

Original issue: golang#20967
Investigation and initial implementation: https://github.com/mmcloughlin/aesnix/
Full implementation in external repo: https://github.com/starius/aesctrat
starius added a commit to starius/go that referenced this issue Dec 6, 2023
The implementation runs up to 8 AES instructions in different registers one
after another in ASM code. Because CPU has instruction pipelining and the
instructions do not depend on each other, they can run in parallel with this
layout of code. This results in significant speedup compared to the regular
implementation in which blocks are processed in the same registers so AES
instructions do not run in parallel.

GCM mode already utilizes the approach.

The type implementing ctrAble has most of its code in XORKeyStreamAt()
method which has an additional argument, offset. It allows to use it
in a stateless way and to jump to any location in the stream.

AES CTR benchmark delta.
$ go test crypto/cipher -bench 'BenchmarkAESCTR*'

AMD64. Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
name                   old time/op    new time/op   delta
BenchmarkAESCTR1K-2       1259ns        266.9ns     -78.8%
BenchmarkAESCTR8K-2       9859ns         1953ns     -80.1%

ARM64. ARM Neoverse-N1 (AWS EC2 t4g.small instance)
name                   old time/op    new time/op   delta
BenchmarkAESCTR1K-2       1098ns        481.1ns     -56.2%
BenchmarkAESCTR8K-2       8447ns         3452ns     -59.1%

Original issue: golang#20967
Investigation and initial implementation: https://github.com/mmcloughlin/aesnix/
Full implementation in external repo: https://github.com/starius/aesctrat
starius added a commit to starius/go that referenced this issue Dec 6, 2023
The implementation runs up to 8 AES instructions in different registers one
after another in ASM code. Because CPU has instruction pipelining and the
instructions do not depend on each other, they can run in parallel with this
layout of code. This results in significant speedup compared to the regular
implementation in which blocks are processed in the same registers so AES
instructions do not run in parallel.

GCM mode already utilizes the approach.

The type implementing ctrAble has most of its code in XORKeyStreamAt()
method which has an additional argument, offset. It allows to use it
in a stateless way and to jump to any location in the stream.

AES CTR benchmark delta.
$ go test crypto/cipher -bench 'BenchmarkAESCTR*'

AMD64. Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
name                   old time/op    new time/op   delta
BenchmarkAESCTR1K-2       1259ns        266.9ns     -78.8%
BenchmarkAESCTR8K-2       9859ns         1953ns     -80.1%

ARM64. ARM Neoverse-N1 (AWS EC2 t4g.small instance)
name                   old time/op    new time/op   delta
BenchmarkAESCTR1K-2       1098ns        481.1ns     -56.2%
BenchmarkAESCTR8K-2       8447ns         3452ns     -59.1%

Original issue: golang#20967
Investigation and initial implementation: https://github.com/mmcloughlin/aesnix/
Full implementation in external repo: https://github.com/starius/aesctrat
starius added a commit to starius/go that referenced this issue Dec 6, 2023
The implementation runs up to 8 AES instructions in different registers one
after another in ASM code. Because CPU has instruction pipelining and the
instructions do not depend on each other, they can run in parallel with this
layout of code. This results in significant speedup compared to the regular
implementation in which blocks are processed in the same registers so AES
instructions do not run in parallel.

GCM mode already utilizes the approach.

The type implementing ctrAble in ASM has most of its code in XORKeyStreamAt
method which has an additional argument, offset. It allows to use it
in a stateless way and to jump to any location in the stream. The method
does not exist in pure Go and boringcrypto implementations.

AES CTR benchmark delta.
$ go test crypto/cipher -bench 'BenchmarkAESCTR*'

AMD64. Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
name                   old time/op    new time/op   delta
BenchmarkAESCTR1K-2       1259ns        266.9ns     -78.8%
BenchmarkAESCTR8K-2       9859ns         1953ns     -80.1%

ARM64. ARM Neoverse-N1 (AWS EC2 t4g.small instance)
name                   old time/op    new time/op   delta
BenchmarkAESCTR1K-2       1098ns        481.1ns     -56.2%
BenchmarkAESCTR8K-2       8447ns         3452ns     -59.1%

Original issue: golang#20967
Investigation and initial implementation: https://github.com/mmcloughlin/aesnix/
Full implementation in external repo: https://github.com/starius/aesctrat
starius added a commit to starius/go that referenced this issue Feb 8, 2024
The implementation runs up to 8 AES instructions in different registers one
after another in ASM code. Because CPU has instruction pipelining and the
instructions do not depend on each other, they can run in parallel with this
layout of code. This results in significant speedup compared to the regular
implementation in which blocks are processed in the same registers so AES
instructions do not run in parallel.

GCM mode already utilizes the approach.

The type implementing ctrAble in ASM has most of its code in XORKeyStreamAt
method which has an additional argument, offset. It allows to use it
in a stateless way and to jump to any location in the stream. The method
does not exist in pure Go and boringcrypto implementations.

AES CTR benchmark delta.
$ go test crypto/cipher -bench 'BenchmarkAESCTR*'

AMD64. Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
name                   old time/op    new time/op   delta
BenchmarkAESCTR1K-2       1259ns        266.9ns     -78.8%
BenchmarkAESCTR8K-2       9859ns         1953ns     -80.1%

ARM64. ARM Neoverse-N1 (AWS EC2 t4g.small instance)
name                   old time/op    new time/op   delta
BenchmarkAESCTR1K-2       1098ns        481.1ns     -56.2%
BenchmarkAESCTR8K-2       8447ns         3452ns     -59.1%

Original issue: golang#20967
Investigation and initial implementation: https://github.com/mmcloughlin/aesnix/
Full implementation in external repo: https://github.com/starius/aesctrat
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/621958 mentions this issue: crypto/aes: speedup CTR mode on AMD64 and ARM64

@dmitshur dmitshur modified the milestones: Unplanned, Go1.24 Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted NeedsFix The path to resolution is known, but the work has not been done. Performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.