Implementaion of dasum, sasum with AVX2 & AVX512 intrinsic #2803

xiegengxin · 2020-08-31T03:49:39Z

Implement dasum, sasum with AVX2 & AVX512 intrinsic for x86_64 architecture to improve the performance.

brada4 · 2020-08-31T06:04:13Z

I think you need to force-align presumed jump targets to palign(5) or so , but measure.
What about AMD ZEN and Intel COOPERLAKE ?

brada4 · 2020-08-31T07:40:41Z

AVX2 and AVX512 needs some warmup, otherwise for small samples it is hundreds of cycles unproductive

xiegengxin · 2020-09-03T07:40:03Z

Align to 64 byte or 32 byte.
When input array size is small (less than 256), using SSE intrinsic to instead of AVX512/AVX2. The performance won't worse too much.

brada4 · 2020-09-03T15:57:18Z

From excellent resource at agner.org

The first YMM or ZMM instruction takes 150-250 clock cycles -probably to start a power-up process

How does it work with intentionally unaligned input, and not ending so round?

Implementaion of dasum, sasum with AVX2 & AVX512 intrinsic

cb3c190

define __AVX2__ to ensure the haswell code compiled with avx2

448152c

align to 64, using SSE when input size is small

1b0f17e

martin-frbg merged commit e72430f into OpenMathLib:develop Sep 6, 2020

Provide feedback