-
Notifications
You must be signed in to change notification settings - Fork 240
Description
Is this a duplicate?
- I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct
Type of Bug
Runtime Error
Component
CUB
Describe the bug
Recently, our Spark application crashes consistently. We have an issue submitted here: NVIDIA/spark-rapids#12949. The issue shows up consistently everytime we run the Spark job, although it shows up at random times in the job.
When debugging it (a multi-threaded application) running on H100 cluster, with stream sync added everywhere, all of our stacktraces point to the same location that is inside a call to thrust::sort
or a location right after it. When we tried to replace thrust::sort
by cub::DeviceMergeSort
, we still observed the same failure.
I spent a huge amount of time dealing with this, trying to generate a simple reproducible code but couldn't have any. Simple code doesn't reproduce any crash-it only shows up in Spark application with at least 4 concurrent threads, and shows up very randomly. Sometimes, it shows up very early while some other times very lately.
By digging into CCCL code for a while, inserting stream sync inside the cub merge sort code and also examining the code, I (probably) finally found the root cause:
#if _CCCL_HAS_PDL
# define _CCCL_PDL_GRID_DEPENDENCY_SYNC() NV_IF_TARGET(NV_PROVIDES_SM_90, cudaGridDependencySynchronize();)
# define _CCCL_PDL_TRIGGER_NEXT_LAUNCH() NV_IF_TARGET(NV_PROVIDES_SM_90, cudaTriggerProgrammaticLaunchCompletion();)
Yes, this is the cause of our issue. Currently, only cub merge sort is using this. When disabling it, our Spark job ran without any issue. I repeated the test 20 times - still no issue.
How to Reproduce
No way to reproduce it by a simple code except running our Spark application.
Expected behavior
There should no crash.
Reproduction link
No response
Operating System
Linux Ubuntu 22.04
nvidia-smi output
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L40S Off | 00000000:01:00.0 Off | 0 |
| N/A 33C P8 32W / 350W | 1MiB / 46068MiB | 0% E. Process |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 NVL Off | 00000000:A1:00.0 Off | 0 |
| N/A 41C P0 63W / 400W | 1MiB / 95830MiB | 0% E. Process |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
Note that we are running H100
for our test.
NVCC version
12.9.1
Metadata
Metadata
Assignees
Labels
Type
Projects
Status