Description
The multigroup mode for OpenMC currently does not scale very well when run using shared memory threading (via OpenMP) past 10 threads or so. For example, on this 2D C5G7 input deck, when I run it on a dual socket Intel Xeon platinum 8180M node with 56 cores and 112 threads, I get the following behavior:
MPI Ranks | OpenMP Threads per Rank | Inactive [particles/sec] | Active [particles/sec] |
---|---|---|---|
1 | 1 | 39,510 | 23,165 |
1 | 4 | 115,423 | 64,612 |
1 | 8 | 154,411 | 67,412 |
1 | 16 | 103,526 | 59,432 |
1 | 56 | 96,408 | 56,838 |
1 | 112 | 134,937 | 80,139 |
56 | 1 | 868,351 | 498,800 |
56 | 2 | 1,102,130 | 754,118 |
28 | 4 | 1,270,910 | 673,048 |
2 | 56 | 233,411 | 142,513 |
These results were generated with gcc 10.2.0 and OpenMPI 2.1.6. I also tried out some other compilers (namely, intel and llvm) and found the results followed the same trend. So, I don't think this can be chalked up to a poor compiler implementation, especially as continuous energy MC scales well with OpenMP on this node.
One initial problem that comes to mind is that the tally space is fairly small for this problem (a 51 x 51 fission rate mesh), which may lead to a lot of memory contention when doing tallies. However, the poor performance seems to affect inactive batches equally, so there must be a more fundamental problem.
My guess is that there is a false sharing issue specific to the multi-group MC mode. Hopefully it can be easily fixed once spotted. I may try to hunt this down at some point in the future but thought I would open the issue now in the interim.