You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, RCCL team. I am running the official rccl-test
And here is the output of HSA_FORCE_FINE_GRAIN_PCIE=1 NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
#
# Using devices
# Rank 0 Pid 9290 on rocm-amd device 0 [0000:12:00.0] Radeon RX 7900 XTX
# Rank 1 Pid 9290 on rocm-amd device 1 [0000:23:00.0] Radeon RX 7900 XT
rocm-amd:9290:9290 [0] NCCL INFO Bootstrap : Using enp36s0:192.168.31.154<0>
rocm-amd:9290:9290 [0] NCCL INFO NET/Plugin : Plugin load (librccl-net.so) returned 2 : librccl-net.so: cannot open shared object file: No such file or directory
rocm-amd:9290:9290 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
rocm-amd:9290:9290 [0] NCCL INFO Kernel version: 6.2.0-36-generic
rocm-amd:9290:9290 [0] /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/init.cc:122 NCCL WARN Missing "amd_iommu=on" from kernel command line which can lead to system instablity or hang!
rocm-amd:9290:9290 [0] /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/init.cc:124 NCCL WARN Missing "iommu=pt" from kernel command line which can lead to system instablity or hang!
RCCL version 2.17.1+hip5.7 HEAD:cbbb3d8+
rocm-amd:9290:9295 [0] NCCL INFO Failed to open libibverbs.so[.1]
rocm-amd:9290:9295 [0] NCCL INFO NET/Socket : Using [0]enp36s0:192.168.31.154<0>
rocm-amd:9290:9295 [0] NCCL INFO Using network Socket
rocm-amd:9290:9296 [1] NCCL INFO Using network Socket
rocm-amd:9290:9296 [1] NCCL INFO rocm_smi_lib: version 5.0.0.0
rocm-amd:9290:9295 [0] NCCL INFO Setting affinity for GPU 0 to 0fff
rocm-amd:9290:9296 [1] NCCL INFO Setting affinity for GPU 1 to 0fff
rocm-amd:9290:9296 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 comm 0xd18720 nRanks 02 busId 23000
rocm-amd:9290:9296 [1] NCCL INFO P2P Chunksize set to 131072
rocm-amd:9290:9295 [0] NCCL INFO Channel 00/02 : 0 1
rocm-amd:9290:9295 [0] NCCL INFO Channel 01/02 : 0 1
rocm-amd:9290:9295 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 comm 0x8a64c0 nRanks 02 busId 12000
rocm-amd:9290:9295 [0] NCCL INFO P2P Chunksize set to 131072
rocm-amd:9290:9295 [0] NCCL INFO Channel 00/0 : 0[12000] -> 1[23000] via P2P/direct pointer comm 0x8a64c0 nRanks 02
rocm-amd:9290:9296 [1] NCCL INFO Channel 00/0 : 1[23000] -> 0[12000] via P2P/direct pointer comm 0xd18720 nRanks 02
rocm-amd:9290:9295 [0] NCCL INFO Channel 01/0 : 0[12000] -> 1[23000] via P2P/direct pointer comm 0x8a64c0 nRanks 02
rocm-amd:9290:9296 [1] NCCL INFO Channel 01/0 : 1[23000] -> 0[12000] via P2P/direct pointer comm 0xd18720 nRanks 02
rocm-amd:9290:9295 [0] NCCL INFO Connected all rings comm 0x8a64c0 nRanks 02 busId 12000
rocm-amd:9290:9295 [0] NCCL INFO Connected all trees
rocm-amd:9290:9295 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 256 | 256
rocm-amd:9290:9295 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
rocm-amd:9290:9296 [1] NCCL INFO Connected all rings comm 0xd18720 nRanks 02 busId 23000
rocm-amd:9290:9296 [1] NCCL INFO Connected all trees
rocm-amd:9290:9296 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 256 | 256
rocm-amd:9290:9296 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
rocm-amd:9290:9296 [1] NCCL INFO MSCCL: No external scheduler found, using internal implementation
rocm-amd:9290:9296 [1] NCCL INFO Using MSCCL files from /opt/rocm-5.7.0/lib/../share/rccl/msccl-algorithms
rocm-amd:9290:9296 [1] NCCL INFO MSCCL: Initialization finished, localSize 448
rocm-amd:9290:9295 [0] NCCL INFO comm 0x8a64c0 rank 0 nranks 2 cudaDev 0 busId 12000 localSize 192 used 20985504 bytes - Init COMPLETE
rocm-amd:9290:9296 [1] NCCL INFO comm 0xd18720 rank 1 nranks 2 cudaDev 1 busId 23000 localSize 192 used 21018272 bytes - Init COMPLETE
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
^C
The text was updated successfully, but these errors were encountered:
Uh oh!
There was an error while loading. Please reload this page.
Context
I am trying to run inference on dual amd gpu gfx1100 using mcl-llm and using this guide https://blog.mlc.ai/2023/10/19/Scalable-Language-Model-Inference-on-Multiple-NVDIA-AMD-GPUs#universal-deployment-support-for-multi-amd-gpu
Description
Hi, RCCL team. I am running the official rccl-test
And here is the output of
HSA_FORCE_FINE_GRAIN_PCIE=1 NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
The text was updated successfully, but these errors were encountered: