Performance issue when running OpenACC code #3681

laytonjbgmail · 2025-05-30T14:31:16Z

laytonjbgmail
May 30, 2025

I'm having issues when I run an OpenACC code inside a SingularityCE container compared to bare metal or Docker container.

When I run on baremetal or in a Docker container, going from the CPU only version (no OpenACC) to the OpenACC, I get a speedup of a little over 100x. However, when I run the same two versions (one is pure CPU, no OpenACC, and the second is the OpenACC version) inside of a SingularityCE container, the speedup is only about 7x.

Here are some details:

Ubuntu 22.04
I'm using SingularityCE version 4.3.0+63-gb7329a593
Docker version 26.1.3, build 26.1.3-0ubuntu-22.04.1
I create the container definition files using HPCCM 25.3.0 (https://github.com/NVIDIA/hpc-container-maker)
The HPCCM recipe is the same for either Docker or Singularity. I can share the recipe and the resulting definition files along with the command to create them.
To build the SingularityCE container, I just use the command, "sudo singularity build Benchmar_apps.sif Singularity.def"
The test code is himeno (https://i.riken.jp/en/supercom/documents/himenobmt/). I made just a few modifications for OpenACC (I can share those modifications if needed but there are very simple - just a copyin() followed by a "parallel loop" that has a reduction, and then later, a "kernels" to copy data from one part of an array to another").
The code is built the same way for both containers: "mpif90 -O3 -Mpreprocess -fast -acc -Minfo=acc -tp=px code.f90 -o himeno-acc.exe"
The base OS for both containers is: nvcr.io/nvidia/nvhpc:25.1-devel-cuda_multi-ubuntu24.04
As you can tell, this uses the NVIDIA HPC SDK compilers and mpirun.
In Docker the code is basically run as:

docker run --gpus device=0 --rm mpirun -np 2 -H localhost:2 --allow-run-as-root --map-by slot -mca coll_hcoll_enable 0 ./himeno-acc.exe > file.output

For SingularityCE:

singularity exec --nv --env NVIDIA_VISIBLE_DEVICES-all mpirun -np 2 -H localhost:2 --allow-run-as-root --map-by slot -mca coll_hcoll_enable 0 ./himeno-acc.exe > file.output

For both the Docker and Singularity containers, I'm using Slurm to run the job. The basics of the sbatch command are:

sbatch -W node_name --ntasks-per-node=2 --nodes=1 ... (docker run or singularity exec)

For comparison, I have started the containers and then used the mpirun command to execute the code. I get the same performance regardless.

I've done some Googling askd asked a few people but so far, no suggestion has changed the performance.

If anyone has any ideas or pointers, I would really appreciate it. Thanks!

laytonjbgmail · 2025-05-30T14:35:38Z

laytonjbgmail
May 30, 2025
Author

I need to amend the original post. If I start the singularity container and then run the code "by hand" (bypassing slurm), then I get the proper performance.

So it appears it's an interaction with Slurm and singluarityCE?

Thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance issue when running OpenACC code #3681

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Performance issue when running OpenACC code #3681

Uh oh!

laytonjbgmail May 30, 2025

Replies: 1 comment

Uh oh!

laytonjbgmail May 30, 2025 Author

laytonjbgmail
May 30, 2025

laytonjbgmail
May 30, 2025
Author