Threading threshold tuning needed for sgemm/dgemm

Based on the data below, the threading threshold tuning might need some help; if you happen to have a good pointer then I'll play and help

The baseline for performance is this:
```
TARGET=SKYLAKEX F_COMPILER=GFORTRAN  SHARED=1 DYNAMIC_THREADS=1 USE_OPENMP=0  NUM_THREADS=1

   Matrix       Float cycles            CPM                                     Double cycles           CPM
   1 x 1               240.9         1.0000          0.0%                              306.7         1.0000          0.0%
   4 x 4               326.4         1.3510          0.0%                              455.0         2.3328          0.0%
   6 x 6               634.2         1.8255          0.0%                              693.3         1.7942          0.0%
   8 x 8               477.3         0.4636          0.0%                              565.9         0.5081          0.0%
  10 x 10              930.5         0.6906          0.0%                             1365.5         1.0598          0.0%
  16 x 16             1166.8         0.2263          0.0%                             1438.2         0.2765          0.0%
  20 x 20             2102.9         0.2329          0.0%                             3411.6         0.3882          0.0%
  32 x 32             4572.0         0.1322          0.0%                             6164.2         0.1788          0.0%
  40 x 40             7998.6         0.1212          0.0%                            12998.0         0.1983          0.0%
  64 x 64            20376.5         0.0768          0.0%                            39989.2         0.1514          0.0%
  80 x 80            37374.8         0.0725          0.0%                            71594.1         0.1392          0.0%
 100 x 100           75786.8         0.0755          0.0%                           143355.2         0.1430          0.0%
 128 x 128          127663.1         0.0608          0.0%                           251965.2         0.1200          0.0%
 150 x 150          258028.7         0.0764          0.0%                           457672.6         0.1355          0.0%
 200 x 200          477604.2         0.0597          0.0%                           971440.2         0.1214          0.0%
 256 x 256          977700.6         0.0583          0.0%                          2259957.4         0.1347          0.0%
 300 x 300         1720165.4         0.0637          0.0%                          3282089.5         0.1215          0.0%
 400 x 400         3911583.9         0.0611          0.0%                          9495537.9         0.1484          0.0%
 500 x 500         7490500.9         0.0599          0.0%                         27368762.1         0.2189          0.0%
 512 x 512         8151211.0         0.0607          0.0%                         32727483.8         0.2438          0.0%
 600 x 600        14568998.7         0.0674          0.0%                         27482490.9         0.1272          0.0%
 700 x 700        28441907.2         0.0829          0.0%                         47884964.0         0.1396          0.0%
 800 x 800        33472600.5         0.0654          0.0%                         71845317.9         0.1403          0.0%
1000 x 1000         66664296.2       0.0667          0.0%                        217093066.9         0.2171          0.0%
1024 x 1024         76171844.0       0.0709          0.0%                        261327051.0         0.2434          0.0%
1200 x 1200        129456787.0       0.0749          0.0%                        304937633.0         0.1765          0.0%
2000 x 2000        715247226.5       0.0894          0.0%                       1823524984.8         0.2279          0.0%
```

The Cycles-per-multiply metric is likely the most useful as performance metric; for Float the hardware in question has a theoretical limit of 0.03125 CPM and we get close to half of theoretical as matrixes get bigger.


Once threading gets enabled (20 logical cpus, 10 physical cores on the system) things get interesting:
(the percentages is performance delta to baseline, negative numbers are performance loss)
```
TARGET=SKYLAKEX F_COMPILER=GFORTRAN  SHARED=1 DYNAMIC_THREADS=1 USE_OPENMP=0  NUM_THREADS=20

   Matrix       Float cycles            CPM                                     Double cycles           CPM
   1 x 1               261.2         1.0000         -8.4%                              319.2         1.0000         -4.1%
   4 x 4               343.0         1.2943         -5.1%                              464.8         2.2901         -2.1%
   6 x 6               634.1         1.7314          0.0%                              707.2         1.8010         -2.0%
   8 x 8               475.2         0.4199          0.4%                              576.0         0.5034         -1.8%
  10 x 10              941.8         0.6817         -1.2%                             1366.0         1.0478         -0.0%
  16 x 16             1192.4         0.2276         -2.2%                             1412.1         0.2671          1.8%
  20 x 20             2132.0         0.2340         -1.4%                             3446.1         0.3910         -1.0%
  32 x 32             4554.3         0.1310          0.4%                             6170.7         0.1786         -0.1%
  40 x 40             8028.2         0.1214         -0.4%                            12959.8         0.1975          0.3%
  64 x 64            20327.9         0.0766          0.2%                            40080.3         0.1517         -0.2%
  80 x 80           156349.0         0.3049       -318.3%                           180205.0         0.3513       -151.7%
 100 x 100          202101.6         0.2018       -166.7%                           220979.9         0.2207        -54.1%
 128 x 128          306939.2         0.1462       -140.4%                           312681.5         0.1489        -24.1%
 150 x 150          311053.7         0.0921        -20.6%                           361673.4         0.1071         21.0%
 200 x 200          404916.4         0.0506         15.2%                           486996.6         0.0608         49.9%
 256 x 256          599879.4         0.0357         38.6%                           743093.4         0.0443         67.1%
 300 x 300          931914.4         0.0345         45.8%                          1248691.5         0.0462         62.0%
 400 x 400          930678.0         0.0145         76.2%                          1645345.5         0.0257         82.7%
 500 x 500         1740345.0         0.0139         76.8%                          2897384.9         0.0232         89.4%
 512 x 512         2012971.7         0.0150         75.3%                          3223831.4         0.0240         90.1%
 600 x 600         3428350.2         0.0159         76.5%                          5333402.9         0.0247         80.6%
 700 x 700         4427124.5         0.0129         84.4%                          7267851.8         0.0212         84.8%
 800 x 800         4461900.5         0.0087         86.7%                          8944377.1         0.0175         87.6%
1000 x 1000          9285704.1       0.0093         86.1%                         16669766.7         0.0167         92.3%
1024 x 1024         11111463.6       0.0103         85.4%                         19204966.9         0.0179         92.7%
1200 x 1200         14541484.8       0.0084         88.8%                         26442997.1         0.0153         91.3%
2000 x 2000         56051476.5       0.0070         92.2%                        109212662.6         0.0137         94.0%
```

For very small matrixes there is a little bit of overhead, but thanks to @oon3m0oo and @sandwichmaker, this overhead is pretty tiny.
HOWEVER, once threading kicks in (just after 64x64) performance tanks compares to the baseline, and does not recover until 200x200 size. 

This is not related to OpenMP-versus-threads, with OpenMP the data looks like this:
```
   Matrix               Float cycles            CPM                                     Double cycles           CPM
   1 x 1                       321.3         1.0000        -33.0%                              367.6         1.0000        -20.2%
   4 x 4                       407.0         1.3538        -24.4%                              533.1         2.6029        -18.0%
   6 x 6                       724.0         1.8690        -15.6%                              749.1         1.7712         -9.4%
   8 x 8                       537.5         0.4243        -11.5%                              645.7         0.5452        -12.8%
  10 x 10                     1010.8         0.6905         -8.2%                             1442.2         1.0756         -5.6%
  16 x 16                     1234.5         0.2232         -4.8%                             1510.9         0.2794         -3.8%
  20 x 20                     2187.3         0.2334         -2.4%                             3481.5         0.3894         -1.6%
  32 x 32                     4552.2         0.1291         -0.5%                             6221.6         0.1787         -1.8%
  40 x 40                     8160.9         0.1225         -1.5%                            13212.9         0.2007         -1.3%
  64 x 64                    20592.2         0.0773          0.2%                            39986.8         0.1511          0.2%
  80 x 80                   142826.6         0.2783       -275.9%                           167659.7         0.3267       -133.4%
 100 x 100                  196696.8         0.1964       -162.9%                           210191.3         0.2098        -50.4%
 128 x 128                  304810.0         0.1452       -132.4%                           304392.2         0.1450        -19.4%
 150 x 150                  300171.4         0.0888        -14.0%                           354805.6         0.1050         21.7%
 200 x 200                  391677.0         0.0489         20.3%                           487251.6         0.0609         50.1%
 256 x 256                  599838.1         0.0357         37.7%                           741835.3         0.0442         65.9%
 300 x 300                  930674.1         0.0345         45.6%                          1253722.9         0.0464         63.8%
 400 x 400                  950285.4         0.0148         75.4%                          1638163.7         0.0256         79.8%
 500 x 500                 1734750.9         0.0139         77.3%                          2909310.0         0.0233         88.8%
 600 x 600                 3425653.7         0.0159         77.3%                          5346582.5         0.0248         79.7%
 700 x 700                 4434540.4         0.0129         84.4%                          7263461.4         0.0212         84.2%
 800 x 800                 4450149.4         0.0087         86.1%                          8915261.2         0.0174         88.2%
1000 x 1000                9304762.8         0.0093         86.0%                         16688999.4         0.0167         92.1%
1200 x 1200               14489066.5         0.0084         88.8%                         26425386.2         0.0153         91.3%
2000 x 2000               55720002.7         0.0070         92.2%                        108477700.1         0.0136         93.5%
```

Which shows OpenMP has a bit more baseline overhead, but has otherwise the same problem after 64x64 to 200x200



So my conclusion is that the threading kicks in at too small matrices currently.... if it started at 200x200 then there would be a win across the board.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Threading threshold tuning needed for sgemm/dgemm #1622

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Threading threshold tuning needed for sgemm/dgemm #1622

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions