Description
The single precision gemm kernel consistently crash on an old x86_64 CPU. But only with large enough matrices. Like 1000*1000.
Here is the simplest program I could write that shows it.
#include <stdlib.h>
#include <cblas.h>
#define SIZE 1000
int main(void) {
float A[SIZE * SIZE];
float C[SIZE * SIZE];
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, SIZE, SIZE, SIZE, 1, A, SIZE, A, SIZE, 0, C, SIZE);
return 0;
}
Weirdly enough, it doesn't crash for a single thread.
$ gcc -o sgemm_test sgemm_test.c -lopenblas
$ ./sgemm_test
zsh: segmentation fault ./sgemm_test
$ OPENBLAS_NUM_THREADS=1 ./sgemm_test
$ OPENBLAS_NUM_THREADS=2 ./sgemm_test
zsh: segmentation fault OPENBLAS_NUM_THREADS=2 ./sgemm_test
And here is the cpuinfo.
$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 104
model name : AMD Athlon(tm) 64 X2 Dual-Core Processor TK-55
stepping : 1
cpu MHz : 1800.000
cache size : 256 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow rep_good nopl cpuid extd_apicid pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy 3dnowprefetch vmmcall lbrv
bugs : apic_c1e fxsave_leak sysret_ss_attrs null_seg swapgs_fence amd_e400 spectre_v1 spectre_v2
bogomips : 3591.07
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc 100mhzsteps
processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 104
model name : AMD Athlon(tm) 64 X2 Dual-Core Processor TK-55
stepping : 1
cpu MHz : 1800.000
cache size : 256 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow rep_good nopl cpuid extd_apicid pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy 3dnowprefetch vmmcall lbrv
bugs : apic_c1e fxsave_leak sysret_ss_attrs null_seg swapgs_fence amd_e400 spectre_v1 spectre_v2
bogomips : 3591.07
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc 100mhzsteps
Here is what I understood of this bug.
The kernel sgemm_kernel_8x4_sse
allocate an additional stack space at some point.
https://github.com/xianyi/OpenBLAS/blob/4741ce803bd13acb4ff0ff1cf57f7a64cf7ef77c/kernel/x86_64/gemm_kernel_8x4_sse.S#L385-L387
The 128 bytes are used for some variables, while the LOCAL_BUFFER_SIZE
is just enough space to be used in a loop later on.
https://github.com/xianyi/OpenBLAS/blob/4741ce803bd13acb4ff0ff1cf57f7a64cf7ef77c/kernel/x86_64/gemm_kernel_8x4_sse.S#L419-L478
But the beginning of the buffer used in the loop is defined to start 256 bytes into the stack.
https://github.com/xianyi/OpenBLAS/blob/4741ce803bd13acb4ff0ff1cf57f7a64cf7ef77c/kernel/x86_64/gemm_kernel_8x4_sse.S#L84
As a result, the loop overwrite the old stack frame, including the saved registers. This makes the program crash later on.
The fix should be pretty straightforward: allocate $256 + LOCAL_BUFFER_SIZE
bytes on the stack. I tried and it works.
I guess this bug has been hiding there for near a decade. I also think there's the very same bug in gemm_kernel_4x8_nano.S
. But I can't test it.