Description
The CUDA PI appears to suffer performance issues with ranges that are primes on the second or third dimension for a non-ndrange parallel_for
.
Concretely, the performance of a launch such as
h.parallel_for(sycl::range<2>(1, someLargePrime), [=](auto id){ /* use id[1] */ });
is much slower than the equivalent:
h.parallel_for(sycl::range<2>(someLargePrime, 1), [=](auto id){ /* use id[0] */ });
If we profile the kernels using something like Nsight compute, we can see the slower kernels being launched with a block size of 1 for all dimensions, as opposed to a sensible number suggested by cuOccupancyMaxPotentialBlockSize
(CUDA PI calls this internally from guessLocalWorkSize
for block size heuristic).
Note that enabling PI TRACE is not helpful in debugging this as it only records the name of the CUDA API call along with the return value; critical information such as the kernel name, launch bounds, and memory sizes are not reported.
Here's a single file reproducer:
// size.cpp
#include <CL/sycl.hpp>
#include <iostream>
template <typename T> double run(size_t N) {
std::vector<T> hostAs(N, T(0.1));
std::vector<T> hostBs(N, T(-0.5));
std::vector<T> hostCs(N, T(0.0));
sycl::queue queue;
sycl::buffer<T, 1> a(hostAs.data(), sycl::range<1>(N));
sycl::buffer<T, 1> b(hostBs.data(), sycl::range<1>(N));
sycl::buffer<T, 1> c(hostCs.data(), sycl::range<1>(N));
auto start = std::chrono::high_resolution_clock::now();
queue.submit([&](sycl::handler &h) {
auto a_acc = a.template get_access<sycl::access::mode::read>(h);
auto b_acc = b.template get_access<sycl::access::mode::read>(h);
auto c_acc = c.template get_access<sycl::access::mode::write>(h);
h.parallel_for<class the_kernel>(sycl::range<2>(1, N), [=](sycl::id<2> id) {
auto init = a_acc[id[1]];
for(size_t n = 0; n < 1000; n++) init = sycl::pow(init, b_acc[id[1]] + n);
c_acc[id[1]] = init;
});
});
queue.wait_and_throw();
auto end = std::chrono::high_resolution_clock::now();
auto c_acc = c.get_host_access();
for (size_t j = 0; j < N; ++j) {
auto expected = hostAs[j];
for(size_t n = 0; n < 1000; n++) expected = sycl::pow(expected, hostBs[j] + n);
auto actual = c_acc[j];
if (std::abs(actual - expected) > 0.000001) {
std::cerr << "Bad value at [" << j << "], expecting " << expected << " but was " << actual << std::endl;
}
};
return std::chrono::duration<double, std::milli>(end - start).count();
}
template <typename T> double runN(size_t N, size_t times) {
for (size_t i = 0; i < times; ++i) run<T>(N); // warm up
double total{};
for (size_t i = 0; i < times; ++i) total += run<T>(N);
return total;
}
int main() {
std::vector<size_t> primes = {1009, 2003, 3001, 4001, 5003, 6007, 7001, 8009, 9001, 10007};
std::vector<size_t> nonPrimes{1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000};
using Element = float;
sycl::queue queue;
auto device = queue.get_device();
std::cerr << "Device = " << device.template get_info<sycl::info::device::name>() << std::endl;
std::cout << "N,elapsed(ms)" << std::endl;
for (auto x : primes) {
auto elapsed = runN<Element>(x, 5);
std::cout << x << "," << elapsed << std::endl;
}
for (auto x : nonPrimes) {
auto elapsed = runN<Element>(x, 5);
std::cout << x << "," << elapsed << std::endl;
}
std::cerr << "Done" << std::endl;
return EXIT_SUCCESS;
}
Testing it on an V100 GPU gives:
> clang++ -v
clang version 16.0.0 (https://github.com/intel/llvm f865dbdce9581424f5f5599454de5e7b90116966)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/br-wlin/sycl_workspace/llvm/build/bin
Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/8
Selected GCC installation: /usr/lib/gcc/x86_64-redhat-linux/8
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64
Found CUDA installation: /lustre/projects/bristol/modules-phase3/nvhpc/22.9/Linux_x86_64/22.9/cuda/11.7, version 11.7
> time clang++ -O3 -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_70 size.cpp
real 0m6.262s # this is very slow for a simple program like this
user 0m5.755s
sys 0m0.352s
> ./a.out
N,elapsed(ms)
1009,5.39113
2003,4.12835
3001,6.87786
4001,7.86372
5003,8.63055
6007,11.6794
7001,12.5063
8009,14.8147
9001,16.0661
10007,16.5786
1000,5.34469
2000,5.3888
3000,5.42235
4000,5.43256
5000,5.45543
6000,5.50536
7000,5.52733
8000,5.5306
9000,5.55627
10000,5.56491
There's a merged PR that rounds up ranges to the closest multiple of 32 to avoid bad block sizes.
The PR wraps the kernel body to mask out any extra threads: a typical pattern seen in many CUDA kernels.
Unfortunately, it only applies this optimisation by checking whether the first dimension is something that is not divisible by 32.
No checks were performed on the second and third dimension, so ranges like range<2>(1, someLargePrime)
won't be considered for wrapping.
At runtime, when a kernel with a prime in the range gets enqueued to the CUDA PI, the runtime tries to find a block size that can fully divide the prime number, thus ending up with 1 for the block size.
There's a few critical points that I think may affect how we approach a solution:
- The 32 block size constant defined in the SYCL headers from the PR seems to be a somewhat platform-dependent value. It should be up to the PI to decide whether a uniform range can be made from the given grid sizes at runtime. Concretely, the generated kernel images should be robust in accepting different block and grid sizes as long as the global workgroup range can be covered.
- There's significant difference on the accepted range for
CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X
versusCU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_Y/Z
, newer cards (>= Volta, possibly older) now report 2147483647 for X and 65535 for Y and Z. Having the Y/Z dimension be limited to 65K seems arbitrary and quite restrictive for HPC use cases. - Supporting up-to 3 dimensions is mostly inherited from computer graphics, and if SYCL-next were to add support for arbitrary dimensions (been hearing about it for a while now), we’ll be forced to encode (linearise the indices) it as 1D launches.
Personally I think the concern of selecting the optimal launch configuration should be done entirely in the PI, the compiler just needs to make sure extra threads are masked off.
The associated risk of doing this is the potential (minor) branch overhead that every kernel now has to pay for.
We should also consider linearising >1D launches which simplifies the logic around cuOccupancyMaxPotentialBlockSize
and also avoids the bad block size issue entirely.
Again, there may be a cost for doing this that needs further investigation.
Note that prime ranges discussed here isn't some contrived test case.
The performance issue was originally discovered while scaling the SYCL implementation of CloverLeaf across multiple MPI ranks.
With large enough sizes, we observed random jumps in the runtime for certain rank counts.