Poor performance for 2d/3d range kernels involving primes on CUDA PI

The CUDA PI appears to suffer performance issues with ranges that are primes on the second or third dimension for a non-ndrange `parallel_for`.
Concretely, the performance of a launch such as
```cpp
h.parallel_for(sycl::range<2>(1, someLargePrime), [=](auto id){ /* use id[1] */ });
```
is much slower than the equivalent:
```cpp
h.parallel_for(sycl::range<2>(someLargePrime, 1), [=](auto id){ /* use id[0] */ });
```

If we profile the kernels using something like Nsight compute, we can see the slower kernels being launched with a block size of 1 for all dimensions, as opposed to a sensible number suggested by `cuOccupancyMaxPotentialBlockSize` (CUDA PI calls this internally from `guessLocalWorkSize` for block size heuristic).
Note that enabling PI TRACE is not helpful in debugging this as it only records the name of the CUDA API call along with the return value; critical information such as the kernel name, launch bounds, and memory sizes are not reported.

Here's a single file reproducer:

```cpp
// size.cpp
#include <CL/sycl.hpp>
#include <iostream>

template <typename T> double run(size_t N) {
  std::vector<T> hostAs(N, T(0.1));
  std::vector<T> hostBs(N, T(-0.5));
  std::vector<T> hostCs(N, T(0.0));
  sycl::queue queue;
  sycl::buffer<T, 1> a(hostAs.data(), sycl::range<1>(N));
  sycl::buffer<T, 1> b(hostBs.data(), sycl::range<1>(N));
  sycl::buffer<T, 1> c(hostCs.data(), sycl::range<1>(N));
  auto start = std::chrono::high_resolution_clock::now();
  queue.submit([&](sycl::handler &h) {
    auto a_acc = a.template get_access<sycl::access::mode::read>(h);
    auto b_acc = b.template get_access<sycl::access::mode::read>(h);
    auto c_acc = c.template get_access<sycl::access::mode::write>(h);
    h.parallel_for<class the_kernel>(sycl::range<2>(1, N), [=](sycl::id<2> id) {
      auto init = a_acc[id[1]];
      for(size_t n = 0; n < 1000; n++) init = sycl::pow(init, b_acc[id[1]] + n);
      c_acc[id[1]] = init;
    });
  });
  queue.wait_and_throw();
  auto end = std::chrono::high_resolution_clock::now();
  auto c_acc = c.get_host_access();
  for (size_t j = 0; j < N; ++j) {
    auto expected = hostAs[j];
    for(size_t n = 0; n < 1000; n++) expected = sycl::pow(expected, hostBs[j] + n);
    auto actual = c_acc[j];
    if (std::abs(actual - expected) > 0.000001) {
      std::cerr << "Bad value at [" << j << "], expecting " << expected << " but was " << actual << std::endl;
    }
  };
  return std::chrono::duration<double, std::milli>(end - start).count();
}
template <typename T> double runN(size_t N, size_t times) {
  for (size_t i = 0; i < times; ++i) run<T>(N); // warm up
  double total{};
  for (size_t i = 0; i < times; ++i) total += run<T>(N);
  return total;
}
int main() {
  std::vector<size_t> primes = {1009, 2003, 3001, 4001, 5003, 6007, 7001, 8009, 9001, 10007};
  std::vector<size_t> nonPrimes{1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000};
  using Element = float;
  sycl::queue queue;
  auto device = queue.get_device();
  std::cerr << "Device = " << device.template get_info<sycl::info::device::name>() << std::endl;
  std::cout << "N,elapsed(ms)" << std::endl;
  for (auto x : primes) {
    auto elapsed = runN<Element>(x, 5);
    std::cout << x << "," << elapsed << std::endl;
  }
  for (auto x : nonPrimes) {
    auto elapsed = runN<Element>(x, 5);
    std::cout << x << "," << elapsed << std::endl;
  }
  std::cerr << "Done" << std::endl;
  return EXIT_SUCCESS;
}
```

Testing it on an V100 GPU gives:
```shell
> clang++ -v
clang version 16.0.0 (https://github.com/intel/llvm f865dbdce9581424f5f5599454de5e7b90116966)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/br-wlin/sycl_workspace/llvm/build/bin
Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/8
Selected GCC installation: /usr/lib/gcc/x86_64-redhat-linux/8
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64
Found CUDA installation: /lustre/projects/bristol/modules-phase3/nvhpc/22.9/Linux_x86_64/22.9/cuda/11.7, version 11.7

> time clang++ -O3 -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_70 size.cpp
real    0m6.262s # this is very slow for a simple program like this
user    0m5.755s
sys     0m0.352s
> ./a.out
N,elapsed(ms)
1009,5.39113
2003,4.12835
3001,6.87786
4001,7.86372
5003,8.63055
6007,11.6794
7001,12.5063
8009,14.8147
9001,16.0661
10007,16.5786
1000,5.34469
2000,5.3888
3000,5.42235
4000,5.43256
5000,5.45543
6000,5.50536
7000,5.52733
8000,5.5306
9000,5.55627
10000,5.56491
```

There's a merged [PR](https://github.com/intel/llvm/pull/2703) that rounds up ranges to the closest multiple of 32 to avoid bad block sizes.
The PR wraps the kernel body to mask out any extra threads: a typical pattern seen in many CUDA kernels.
Unfortunately, it only applies this optimisation by checking whether the first dimension is something that is not divisible by *32*.
No checks were performed on the second and third dimension, so ranges like `range<2>(1, someLargePrime)` won't be considered for wrapping.
At runtime, when a kernel with a prime in the range gets enqueued to the CUDA PI, the runtime tries to find a block size that can fully divide the prime number, thus ending up with 1 for the block size.


There's a few critical points that I think may affect how we approach a solution:

 * The *32* block size constant defined in the SYCL headers from the [PR](https://github.com/intel/llvm/pull/2703) seems to be a somewhat platform-dependent value. It should be up to the PI to decide whether a uniform range can be made from the given grid sizes at runtime. Concretely, the generated kernel images should be robust in accepting different block and grid sizes as long as the global workgroup range can be covered.
 * There's significant difference on the accepted range for `CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X` versus `CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_Y/Z`, newer cards (>= Volta, possibly older) now report 2147483647 for X and 65535 for Y and Z. Having the Y/Z dimension be limited to 65K seems arbitrary and quite restrictive for HPC use cases.
 * Supporting up-to 3 dimensions is mostly inherited from computer graphics, and if SYCL-next were to add support for arbitrary dimensions (been hearing about it for a while now), we’ll be forced to encode (linearise the indices) it as 1D launches.


Personally I think the concern of selecting the optimal launch configuration should be done entirely in the PI, the compiler just needs to make sure extra threads are masked off.
The associated risk of doing this is the potential (minor) branch overhead that every kernel now has to pay for.
We should also consider linearising >1D launches which simplifies the logic around `cuOccupancyMaxPotentialBlockSize` and also avoids the bad block size issue entirely.
Again, there may be a cost for doing this that needs further investigation.

Note that prime ranges discussed here isn't some contrived test case.
The performance issue was originally discovered while scaling the SYCL implementation of CloverLeaf across multiple MPI ranks.
With large enough sizes, we observed random jumps in the runtime for certain rank counts.


@tomdeakin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Poor performance for 2d/3d range kernels involving primes on CUDA PI #8018

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Poor performance for 2d/3d range kernels involving primes on CUDA PI #8018

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions