Skip to content

Consider removing x86_64 field assembly #726

Closed
@jonasnick

Description

@jonasnick

I noticed that turning off x86_64 assembly on my laptop actually speeds up ecdsa_verify. The internal benchmarks show that --without-asm scalar operations are slower, but field operations are faster. In order to investigate this I created a branch that includes the configurable benchmark iterations from #722, a test_matrix.sh script and allows turning on field assembly individually (https://github.com/jonasnick/secp256k1/tree/eval-asm).

Here are the results with gcc 9.3.0 (got similar results with clang 9.0.1):

SECP256K1_BENCH_ITERS=200000

bench config CFLAGS=-DUSE_ASM_X86_64_FIELD ./configure --disable-openssl-tests --with-asm=x86_64
scalar_sqr: min 0.0331us / avg 0.0332us / max 0.0337us
scalar_mul: min 0.0342us / avg 0.0343us / max 0.0345us
field_sqr: min 0.0165us / avg 0.0165us / max 0.0167us
field_mul: min 0.0204us / avg 0.0205us / max 0.0209us
ecdsa_sign: min 40.3us / avg 40.3us / max 40.4us
ecdsa_verify: min 56.9us / avg 56.9us / max 56.9us

bench config CFLAGS= ./configure --disable-openssl-tests --without-asm
scalar_sqr: min 0.0375us / avg 0.0376us / max 0.0383us
scalar_mul: min 0.0362us / avg 0.0366us / max 0.0396us
field_sqr: min 0.0152us / avg 0.0152us / max 0.0152us
field_mul: min 0.0177us / avg 0.0178us / max 0.0178us
ecdsa_sign: min 41.8us / avg 41.8us / max 41.9us
ecdsa_verify: min 54.6us / avg 54.7us / max 54.7us

bench config CFLAGS= ./configure --disable-openssl-tests --with-asm=x86_64
scalar_sqr: min 0.0331us / avg 0.0331us / max 0.0333us
scalar_mul: min 0.0342us / avg 0.0343us / max 0.0347us
field_sqr: min 0.0152us / avg 0.0153us / max 0.0154us
field_mul: min 0.0178us / avg 0.0178us / max 0.0180us
ecdsa_sign: min 40.3us / avg 40.3us / max 40.4us
ecdsa_verify: min 53.2us / avg 53.2us / max 53.2us

Note the 6.5% ecdsa_verify speedup. However, I don't fully understand this:

  1. There's assembly for field_sqr and field_mul. If we remove it, both functions are faster. But, some other internal functions are slower. For example:
    SECP256K1_BENCH_ITERS=200000
    group_add_affine: min 0.257us / avg 0.257us / max 0.259us
    vs.
    group_add_affine: min 0.263us / avg 0.263us / max 0.264us
    
    This could just be an artifact of micro-benching and I have not tested this with Is the compiler optimizing out some of the benchmarks? #667.
  2. Removing field arithmetic also makes ecdsa verification slower if endomorphism is enabled.
    SECP256K1_BENCH_ITERS=200000
    ecdsa_verify: min 41.1us / avg 41.1us / max 41.1us
    vs.
    ecdsa_verify: min 41.5us / avg 41.6us / max 41.6us
    

It should be noted that without field arithmetic assembly, in order to use 64 bit field arithmetic you need to have __int128 support (or use field=32bit with a 40% verification slowdown). I did not check where this is supported (MSVC?). Also we should try this with older compilers.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions