Some thoughts for improving Haswell sgemm performance

I tested the SGEMM performance of OpenBLAS(haswell) against Intel MKL(2018) with large matrices on i9 9900K. The 1-thread speed of OpenBLAS was 90% of MKL and the 8-thread speed was 80% of MKL.

By linux perf I discovered the performance penalty was mainly due to private cache misses. However adding prefetch instructions didn't work. It looks like the problem is originated from cache bandwidths.

The current SGEMM kernel for Haswell (KERNEL16x6) reads 10.7 bytes from packed matrix A and 4 bytes from buffered matrix B per CPU cycle, if the 2 FMA units are running all the time. For i9-9900K at 4 GHz, this means 43 GB/s read from packed A and 16 GB/s read from buffered B. Under 64-bit Linux OS, the size of packed matrix A is 1152 kB so it should sit in L3 cache; buffered matrix B occupies 9 kB so it should stay in L1 cache.

I previously found (via AIDA64) the read bandwidth of L3 on 9900K is 315 GB/s under default settings. With 8 threads, the required bandwidth for reading from packed A is 341 GB/s at 4 GHz CPU clock, when L3 cache becomes a bottleneck.

For serial execution, the bandwidth of L3 may not be a problem, but the design of reading a larger array with faster speed is a little inappropriate and more likely to encounter cache and TLB misses.

I suggest changing the calculation kernel to KERNEL8x12 or KERNEL4x24, with modifications of other parts of the kernel code to fit it. (That's a huge amount of work, which I can't do all by myself...but it's the only way to slow down reading on A) The icopy and ocopy routines need corresponding changes. 

For example:
.macro KERNEL4x24_SUB
	vmovups 	-24 * SIZE(BO), %ymm1
	vmovups 	-16 * SIZE(BO), %ymm2
	vmovups 	 -8 * SIZE(BO), %ymm3
	addq		$ 24 * SIZE, BO
	vbroadcastss	 -4 * SIZE(AO), %ymm0
	vfmadd231ps	%ymm0 ,%ymm1 ,%ymm4
	vfmadd231ps	%ymm0 ,%ymm2 ,%ymm5
	vfmadd231ps	%ymm0 ,%ymm3 ,%ymm6
	vbroadcastss	 -3 * SIZE(AO), %ymm0
	vfmadd231ps	%ymm0 ,%ymm1 ,%ymm7
	vfmadd231ps	%ymm0 ,%ymm2 ,%ymm8
	vfmadd231ps	%ymm0 ,%ymm3 ,%ymm9
	vbroadcastss	 -2 * SIZE(AO), %ymm0
	vfmadd231ps	%ymm0 ,%ymm1 ,%ymm10
	vfmadd231ps	%ymm0 ,%ymm2 ,%ymm11
	vfmadd231ps	%ymm0 ,%ymm3 ,%ymm12
	vbroadcastss	 -1 * SIZE(AO), %ymm0
	vfmadd231ps	%ymm0 ,%ymm1 ,%ymm13
	vfmadd231ps	%ymm0 ,%ymm2 ,%ymm14
	vfmadd231ps	%ymm0 ,%ymm3 ,%ymm15
	addq		$  4 * SIZE, AO
	decq		%rax
.endm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some thoughts for improving Haswell sgemm performance #2210

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Some thoughts for improving Haswell sgemm performance #2210

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions