Skip to content

Stuck on no stats #957

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bojak83318 opened this issue Nov 8, 2023 · 0 comments
Closed

Stuck on no stats #957

bojak83318 opened this issue Nov 8, 2023 · 0 comments

Comments

@bojak83318
Copy link

bojak83318 commented Nov 8, 2023

Context

I am trying to run inference on dual amd gpu gfx1100 using mcl-llm and using this guide https://blog.mlc.ai/2023/10/19/Scalable-Language-Model-Inference-on-Multiple-NVDIA-AMD-GPUs#universal-deployment-support-for-multi-amd-gpu

Description

Hi, RCCL team. I am running the official rccl-test
And here is the output of HSA_FORCE_FINE_GRAIN_PCIE=1 NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2

#
# Using devices
#   Rank  0 Pid   9290 on   rocm-amd device  0 [0000:12:00.0] Radeon RX 7900 XTX
#   Rank  1 Pid   9290 on   rocm-amd device  1 [0000:23:00.0] Radeon RX 7900 XT
rocm-amd:9290:9290 [0] NCCL INFO Bootstrap : Using enp36s0:192.168.31.154<0>
rocm-amd:9290:9290 [0] NCCL INFO NET/Plugin : Plugin load (librccl-net.so) returned 2 : librccl-net.so: cannot open shared object file: No such file or directory
rocm-amd:9290:9290 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
rocm-amd:9290:9290 [0] NCCL INFO Kernel version: 6.2.0-36-generic

rocm-amd:9290:9290 [0] /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/init.cc:122 NCCL WARN Missing "amd_iommu=on" from kernel command line which can lead to system instablity or hang!

rocm-amd:9290:9290 [0] /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/init.cc:124 NCCL WARN Missing "iommu=pt" from kernel command line which can lead to system instablity or hang!
RCCL version 2.17.1+hip5.7 HEAD:cbbb3d8+
rocm-amd:9290:9295 [0] NCCL INFO Failed to open libibverbs.so[.1]
rocm-amd:9290:9295 [0] NCCL INFO NET/Socket : Using [0]enp36s0:192.168.31.154<0>
rocm-amd:9290:9295 [0] NCCL INFO Using network Socket
rocm-amd:9290:9296 [1] NCCL INFO Using network Socket
rocm-amd:9290:9296 [1] NCCL INFO rocm_smi_lib: version 5.0.0.0
rocm-amd:9290:9295 [0] NCCL INFO Setting affinity for GPU 0 to 0fff
rocm-amd:9290:9296 [1] NCCL INFO Setting affinity for GPU 1 to 0fff
rocm-amd:9290:9296 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 comm 0xd18720 nRanks 02 busId 23000
rocm-amd:9290:9296 [1] NCCL INFO P2P Chunksize set to 131072
rocm-amd:9290:9295 [0] NCCL INFO Channel 00/02 :    0   1
rocm-amd:9290:9295 [0] NCCL INFO Channel 01/02 :    0   1
rocm-amd:9290:9295 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 comm 0x8a64c0 nRanks 02 busId 12000
rocm-amd:9290:9295 [0] NCCL INFO P2P Chunksize set to 131072
rocm-amd:9290:9295 [0] NCCL INFO Channel 00/0 : 0[12000] -> 1[23000] via P2P/direct pointer comm 0x8a64c0 nRanks 02
rocm-amd:9290:9296 [1] NCCL INFO Channel 00/0 : 1[23000] -> 0[12000] via P2P/direct pointer comm 0xd18720 nRanks 02
rocm-amd:9290:9295 [0] NCCL INFO Channel 01/0 : 0[12000] -> 1[23000] via P2P/direct pointer comm 0x8a64c0 nRanks 02
rocm-amd:9290:9296 [1] NCCL INFO Channel 01/0 : 1[23000] -> 0[12000] via P2P/direct pointer comm 0xd18720 nRanks 02
rocm-amd:9290:9295 [0] NCCL INFO Connected all rings comm 0x8a64c0 nRanks 02 busId 12000
rocm-amd:9290:9295 [0] NCCL INFO Connected all trees
rocm-amd:9290:9295 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 256 | 256
rocm-amd:9290:9295 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
rocm-amd:9290:9296 [1] NCCL INFO Connected all rings comm 0xd18720 nRanks 02 busId 23000
rocm-amd:9290:9296 [1] NCCL INFO Connected all trees
rocm-amd:9290:9296 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 256 | 256
rocm-amd:9290:9296 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
rocm-amd:9290:9296 [1] NCCL INFO MSCCL: No external scheduler found, using internal implementation
rocm-amd:9290:9296 [1] NCCL INFO Using MSCCL files from /opt/rocm-5.7.0/lib/../share/rccl/msccl-algorithms
rocm-amd:9290:9296 [1] NCCL INFO MSCCL: Initialization finished, localSize 448
rocm-amd:9290:9295 [0] NCCL INFO comm 0x8a64c0 rank 0 nranks 2 cudaDev 0 busId 12000 localSize 192 used 20985504 bytes - Init COMPLETE
rocm-amd:9290:9296 [1] NCCL INFO comm 0xd18720 rank 1 nranks 2 cudaDev 1 busId 23000 localSize 192 used 21018272 bytes - Init COMPLETE
#
#                                                       out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
^C
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant