Skip to content

Some thoughts for improving Haswell sgemm performance #2210

Closed
@wjc404

Description

@wjc404

I tested the SGEMM performance of OpenBLAS(haswell) against Intel MKL(2018) with large matrices on i9 9900K. The 1-thread speed of OpenBLAS was 90% of MKL and the 8-thread speed was 80% of MKL.

By linux perf I discovered the performance penalty was mainly due to private cache misses. However adding prefetch instructions didn't work. It looks like the problem is originated from cache bandwidths.

The current SGEMM kernel for Haswell (KERNEL16x6) reads 10.7 bytes from packed matrix A and 4 bytes from buffered matrix B per CPU cycle, if the 2 FMA units are running all the time. For i9-9900K at 4 GHz, this means 43 GB/s read from packed A and 16 GB/s read from buffered B. Under 64-bit Linux OS, the size of packed matrix A is 1152 kB so it should sit in L3 cache; buffered matrix B occupies 9 kB so it should stay in L1 cache.

I previously found (via AIDA64) the read bandwidth of L3 on 9900K is 315 GB/s under default settings. With 8 threads, the required bandwidth for reading from packed A is 341 GB/s at 4 GHz CPU clock, when L3 cache becomes a bottleneck.

For serial execution, the bandwidth of L3 may not be a problem, but the design of reading a larger array with faster speed is a little inappropriate and more likely to encounter cache and TLB misses.

I suggest changing the calculation kernel to KERNEL8x12 or KERNEL4x24, with modifications of other parts of the kernel code to fit it. (That's a huge amount of work, which I can't do all by myself...but it's the only way to slow down reading on A) The icopy and ocopy routines need corresponding changes.

For example:
.macro KERNEL4x24_SUB
vmovups -24 * SIZE(BO), %ymm1
vmovups -16 * SIZE(BO), %ymm2
vmovups -8 * SIZE(BO), %ymm3
addq $ 24 * SIZE, BO
vbroadcastss -4 * SIZE(AO), %ymm0
vfmadd231ps %ymm0 ,%ymm1 ,%ymm4
vfmadd231ps %ymm0 ,%ymm2 ,%ymm5
vfmadd231ps %ymm0 ,%ymm3 ,%ymm6
vbroadcastss -3 * SIZE(AO), %ymm0
vfmadd231ps %ymm0 ,%ymm1 ,%ymm7
vfmadd231ps %ymm0 ,%ymm2 ,%ymm8
vfmadd231ps %ymm0 ,%ymm3 ,%ymm9
vbroadcastss -2 * SIZE(AO), %ymm0
vfmadd231ps %ymm0 ,%ymm1 ,%ymm10
vfmadd231ps %ymm0 ,%ymm2 ,%ymm11
vfmadd231ps %ymm0 ,%ymm3 ,%ymm12
vbroadcastss -1 * SIZE(AO), %ymm0
vfmadd231ps %ymm0 ,%ymm1 ,%ymm13
vfmadd231ps %ymm0 ,%ymm2 ,%ymm14
vfmadd231ps %ymm0 ,%ymm3 ,%ymm15
addq $ 4 * SIZE, AO
decq %rax
.endm

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions