Description
Based on the data below, the threading threshold tuning might need some help; if you happen to have a good pointer then I'll play and help
The baseline for performance is this:
TARGET=SKYLAKEX F_COMPILER=GFORTRAN SHARED=1 DYNAMIC_THREADS=1 USE_OPENMP=0 NUM_THREADS=1
Matrix Float cycles CPM Double cycles CPM
1 x 1 240.9 1.0000 0.0% 306.7 1.0000 0.0%
4 x 4 326.4 1.3510 0.0% 455.0 2.3328 0.0%
6 x 6 634.2 1.8255 0.0% 693.3 1.7942 0.0%
8 x 8 477.3 0.4636 0.0% 565.9 0.5081 0.0%
10 x 10 930.5 0.6906 0.0% 1365.5 1.0598 0.0%
16 x 16 1166.8 0.2263 0.0% 1438.2 0.2765 0.0%
20 x 20 2102.9 0.2329 0.0% 3411.6 0.3882 0.0%
32 x 32 4572.0 0.1322 0.0% 6164.2 0.1788 0.0%
40 x 40 7998.6 0.1212 0.0% 12998.0 0.1983 0.0%
64 x 64 20376.5 0.0768 0.0% 39989.2 0.1514 0.0%
80 x 80 37374.8 0.0725 0.0% 71594.1 0.1392 0.0%
100 x 100 75786.8 0.0755 0.0% 143355.2 0.1430 0.0%
128 x 128 127663.1 0.0608 0.0% 251965.2 0.1200 0.0%
150 x 150 258028.7 0.0764 0.0% 457672.6 0.1355 0.0%
200 x 200 477604.2 0.0597 0.0% 971440.2 0.1214 0.0%
256 x 256 977700.6 0.0583 0.0% 2259957.4 0.1347 0.0%
300 x 300 1720165.4 0.0637 0.0% 3282089.5 0.1215 0.0%
400 x 400 3911583.9 0.0611 0.0% 9495537.9 0.1484 0.0%
500 x 500 7490500.9 0.0599 0.0% 27368762.1 0.2189 0.0%
512 x 512 8151211.0 0.0607 0.0% 32727483.8 0.2438 0.0%
600 x 600 14568998.7 0.0674 0.0% 27482490.9 0.1272 0.0%
700 x 700 28441907.2 0.0829 0.0% 47884964.0 0.1396 0.0%
800 x 800 33472600.5 0.0654 0.0% 71845317.9 0.1403 0.0%
1000 x 1000 66664296.2 0.0667 0.0% 217093066.9 0.2171 0.0%
1024 x 1024 76171844.0 0.0709 0.0% 261327051.0 0.2434 0.0%
1200 x 1200 129456787.0 0.0749 0.0% 304937633.0 0.1765 0.0%
2000 x 2000 715247226.5 0.0894 0.0% 1823524984.8 0.2279 0.0%
The Cycles-per-multiply metric is likely the most useful as performance metric; for Float the hardware in question has a theoretical limit of 0.03125 CPM and we get close to half of theoretical as matrixes get bigger.
Once threading gets enabled (20 logical cpus, 10 physical cores on the system) things get interesting:
(the percentages is performance delta to baseline, negative numbers are performance loss)
TARGET=SKYLAKEX F_COMPILER=GFORTRAN SHARED=1 DYNAMIC_THREADS=1 USE_OPENMP=0 NUM_THREADS=20
Matrix Float cycles CPM Double cycles CPM
1 x 1 261.2 1.0000 -8.4% 319.2 1.0000 -4.1%
4 x 4 343.0 1.2943 -5.1% 464.8 2.2901 -2.1%
6 x 6 634.1 1.7314 0.0% 707.2 1.8010 -2.0%
8 x 8 475.2 0.4199 0.4% 576.0 0.5034 -1.8%
10 x 10 941.8 0.6817 -1.2% 1366.0 1.0478 -0.0%
16 x 16 1192.4 0.2276 -2.2% 1412.1 0.2671 1.8%
20 x 20 2132.0 0.2340 -1.4% 3446.1 0.3910 -1.0%
32 x 32 4554.3 0.1310 0.4% 6170.7 0.1786 -0.1%
40 x 40 8028.2 0.1214 -0.4% 12959.8 0.1975 0.3%
64 x 64 20327.9 0.0766 0.2% 40080.3 0.1517 -0.2%
80 x 80 156349.0 0.3049 -318.3% 180205.0 0.3513 -151.7%
100 x 100 202101.6 0.2018 -166.7% 220979.9 0.2207 -54.1%
128 x 128 306939.2 0.1462 -140.4% 312681.5 0.1489 -24.1%
150 x 150 311053.7 0.0921 -20.6% 361673.4 0.1071 21.0%
200 x 200 404916.4 0.0506 15.2% 486996.6 0.0608 49.9%
256 x 256 599879.4 0.0357 38.6% 743093.4 0.0443 67.1%
300 x 300 931914.4 0.0345 45.8% 1248691.5 0.0462 62.0%
400 x 400 930678.0 0.0145 76.2% 1645345.5 0.0257 82.7%
500 x 500 1740345.0 0.0139 76.8% 2897384.9 0.0232 89.4%
512 x 512 2012971.7 0.0150 75.3% 3223831.4 0.0240 90.1%
600 x 600 3428350.2 0.0159 76.5% 5333402.9 0.0247 80.6%
700 x 700 4427124.5 0.0129 84.4% 7267851.8 0.0212 84.8%
800 x 800 4461900.5 0.0087 86.7% 8944377.1 0.0175 87.6%
1000 x 1000 9285704.1 0.0093 86.1% 16669766.7 0.0167 92.3%
1024 x 1024 11111463.6 0.0103 85.4% 19204966.9 0.0179 92.7%
1200 x 1200 14541484.8 0.0084 88.8% 26442997.1 0.0153 91.3%
2000 x 2000 56051476.5 0.0070 92.2% 109212662.6 0.0137 94.0%
For very small matrixes there is a little bit of overhead, but thanks to @oon3m0oo and @sandwichmaker, this overhead is pretty tiny.
HOWEVER, once threading kicks in (just after 64x64) performance tanks compares to the baseline, and does not recover until 200x200 size.
This is not related to OpenMP-versus-threads, with OpenMP the data looks like this:
Matrix Float cycles CPM Double cycles CPM
1 x 1 321.3 1.0000 -33.0% 367.6 1.0000 -20.2%
4 x 4 407.0 1.3538 -24.4% 533.1 2.6029 -18.0%
6 x 6 724.0 1.8690 -15.6% 749.1 1.7712 -9.4%
8 x 8 537.5 0.4243 -11.5% 645.7 0.5452 -12.8%
10 x 10 1010.8 0.6905 -8.2% 1442.2 1.0756 -5.6%
16 x 16 1234.5 0.2232 -4.8% 1510.9 0.2794 -3.8%
20 x 20 2187.3 0.2334 -2.4% 3481.5 0.3894 -1.6%
32 x 32 4552.2 0.1291 -0.5% 6221.6 0.1787 -1.8%
40 x 40 8160.9 0.1225 -1.5% 13212.9 0.2007 -1.3%
64 x 64 20592.2 0.0773 0.2% 39986.8 0.1511 0.2%
80 x 80 142826.6 0.2783 -275.9% 167659.7 0.3267 -133.4%
100 x 100 196696.8 0.1964 -162.9% 210191.3 0.2098 -50.4%
128 x 128 304810.0 0.1452 -132.4% 304392.2 0.1450 -19.4%
150 x 150 300171.4 0.0888 -14.0% 354805.6 0.1050 21.7%
200 x 200 391677.0 0.0489 20.3% 487251.6 0.0609 50.1%
256 x 256 599838.1 0.0357 37.7% 741835.3 0.0442 65.9%
300 x 300 930674.1 0.0345 45.6% 1253722.9 0.0464 63.8%
400 x 400 950285.4 0.0148 75.4% 1638163.7 0.0256 79.8%
500 x 500 1734750.9 0.0139 77.3% 2909310.0 0.0233 88.8%
600 x 600 3425653.7 0.0159 77.3% 5346582.5 0.0248 79.7%
700 x 700 4434540.4 0.0129 84.4% 7263461.4 0.0212 84.2%
800 x 800 4450149.4 0.0087 86.1% 8915261.2 0.0174 88.2%
1000 x 1000 9304762.8 0.0093 86.0% 16688999.4 0.0167 92.1%
1200 x 1200 14489066.5 0.0084 88.8% 26425386.2 0.0153 91.3%
2000 x 2000 55720002.7 0.0070 92.2% 108477700.1 0.0136 93.5%
Which shows OpenMP has a bit more baseline overhead, but has otherwise the same problem after 64x64 to 200x200
So my conclusion is that the threading kicks in at too small matrices currently.... if it started at 200x200 then there would be a win across the board.