Skip to content

[BUG]: cub::DeviceMergeSort intermittently causes cudaErrorIllegalInstruction likely due to PDL (Programmatic Dependent Launch) #5297

@ttnghia

Description

@ttnghia

Is this a duplicate?

Type of Bug

Runtime Error

Component

CUB

Describe the bug

Recently, our Spark application crashes consistently. We have an issue submitted here: NVIDIA/spark-rapids#12949. The issue shows up consistently everytime we run the Spark job, although it shows up at random times in the job.

When debugging it (a multi-threaded application) running on H100 cluster, with stream sync added everywhere, all of our stacktraces point to the same location that is inside a call to thrust::sort or a location right after it. When we tried to replace thrust::sort by cub::DeviceMergeSort, we still observed the same failure.

I spent a huge amount of time dealing with this, trying to generate a simple reproducible code but couldn't have any. Simple code doesn't reproduce any crash-it only shows up in Spark application with at least 4 concurrent threads, and shows up very randomly. Sometimes, it shows up very early while some other times very lately.

By digging into CCCL code for a while, inserting stream sync inside the cub merge sort code and also examining the code, I (probably) finally found the root cause:

#if _CCCL_HAS_PDL
#  define _CCCL_PDL_GRID_DEPENDENCY_SYNC() NV_IF_TARGET(NV_PROVIDES_SM_90, cudaGridDependencySynchronize();)
#  define _CCCL_PDL_TRIGGER_NEXT_LAUNCH() NV_IF_TARGET(NV_PROVIDES_SM_90, cudaTriggerProgrammaticLaunchCompletion();)

Yes, this is the cause of our issue. Currently, only cub merge sort is using this. When disabling it, our Spark job ran without any issue. I repeated the test 20 times - still no issue.

How to Reproduce

No way to reproduce it by a simple code except running our Spark application.

Expected behavior

There should no crash.

Reproduction link

No response

Operating System

Linux Ubuntu 22.04

nvidia-smi output

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    Off |   00000000:01:00.0 Off |                    0 |
| N/A   33C    P8             32W /  350W |       1MiB /  46068MiB |      0%   E. Process |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 NVL                Off |   00000000:A1:00.0 Off |                    0 |
| N/A   41C    P0             63W /  400W |       1MiB /  95830MiB |      0%   E. Process |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

Note that we are running H100 for our test.

NVCC version

12.9.1

Metadata

Metadata

Labels

bugSomething isn't working right.

Type

No type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions