-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Stuck loading VRAM ROCm multi gpu #3991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
my rig with 2 7900 XTX:
|
I had the same kind of problems too. You have to build it with make, cmake caused the cuda errors for me. |
Thanks @8XXD8 Now loading is OK but got endless '#' as output:
running:
|
Have you tried other models?
I had some proper responses too. |
must i make with
OR
for multi gpu ? |
wow you are so COOL! I got my first proper output by 2 gpu:
got output:
|
congrats so happy for u ..
|
btw for make .. what are the commands did you run @wizd ? |
you are right. must do gpu selection on compile time. |
so its |
if i use
i get
|
only possible with this make command
and the output is
but i get jibberish
|
sorry my post was wrong because this line compile without gpu support and runs on cpu:
so the bug is still there. when turn on two gpu:
|
yes when i use 2 gpu its a problem |
i posted upstream on RCCL github ROCm/rccl#957 |
so its just this that works with gpu
right ? |
No, compiling it like this will only produce output that is compatible only with CPUs. |
I dont use HIP_VISIBLE_DEVICES, just
and it works for me. |
Thanks its compiled as per gpu now
but ...
still getting jibberish |
can you perhaps outline like this guide , from start to finish ... |
apparently nvidia also has this problem #3772 |
HIP_VISIBLE_DEVICES is an environment variable read by ROCm during run time and has not reason to be part of the command line for compiling. Simply running "make LLAMA_HIPBLAS=1" is all that is required on my Ubuntu 22.04 server. |
Unclear if this is related, but I can't load any model at all on multigpu ROCm. Segmentation fault after model load for ROCm multi-gpu, multi-gfx. Best I can remember it worked a couple months ago, but has now been broken at least 2 weeks. Tested on: Arch Linux Kernel 6.5.9, ROCm 5.7.1, llamacpp 4a4fd3e rocminfo
make LLAMA_HIPBLAS=1
./main -ngl 99 -m ../koboldcpp/models/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/mistral-7b-instruct-v0.1.Q5_K_M.gguf -mg 0 -p "Write a function in TypeScript that sums numbers"
|
This could be Rdna3/gfx1100 specific. I cross compiled from a debian nvidia build for gfx900 target and it worked fine |
Well i managed to run into the same problem. Its odd that main works with this:
but libllama.so produces garbage
|
Anyways my lanes are not equal , this could be a problem as in rocm 5.7 multi gpu is just a preview PCIe® slots connected to the GPU must have identical PCIe lane width or bifurcation settings, and support PCIe 3.0 Atomics. Refer to How ROCm uses PCIe Atomics for more information. Example: ✓ - GPU0 PCIe x16 connection + GPU1 PCIe x16 connection ✓ - GPU0 PCIe x8 connection + GPU1 PCIe x8 connection X - GPU0 PCIe x16 connection + GPU1 PCIe x8 connection ''' |
Have you tried removing one of the GPU-s? I mean physically, not just disabling it with HIP_VISIBLE_DEVICES. |
I believe the problem lies in how the initialization process is bugged in ROCm. It has been fixed, but the fix will not be released until ROCm 6.0.0 I believe; unless you build ROCm using a self compiled ROCBlas and Tensile yourself after these commits: |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Context
Once it loads it stuck at loading VRAM
My computer is running Dual AMD GPU 7900 XTX and 7900 XT Ubuntu 22.04 , ROCm 5.7
ROCM-SMI Output
$ python3 --version
Python 3.10.12
$ make --version
GNU Make 4.3
Built for x86_64-pc-linux-gnu
$ g++ --version
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Steps to Reproduce
Failure Logs
The text was updated successfully, but these errors were encountered: