-
Notifications
You must be signed in to change notification settings - Fork 11.8k
b1428 OOM error on 3x P40 setup #3780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Still present at today's pull:
|
B1412 is the newest commit which works for me. |
on commit 207b519 (HEAD -> master, tag: b1446, origin/master, origin/HEAD) it no longer OOMs but performance is very poor
|
I switched to origin/cude-multi-gpu as a test, and find that essentially, I get the same speed whether I offload 83 or zero layers.
Is "LLAMA_CUBLAS=1" no longer correct for cublas? It 'uses' the GPUs, but performance suggests otherwise. |
current 'cuda-cublas-opts' branch seems to fix the OOM and poor performance issues. |
I reverted to:
using "-ts 2,3,3" I can avoid the OOM, but processing is CPU-slow, like what you'd get if you gave it a single thread. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Expected Behavior
llama.cpp able to generate a reply without OOM
Current Behavior
./server -m ./models/llama-2-70b-chat/llama-2-70b-chat-q6k.gguf -ngl 83 --rope-freq-base 26000 -c 8192
llama.cpp works on the prompt and eventually throws a CUDA OOM error
Environment and Context
Reverting to b1407, the problem is not present
The text was updated successfully, but these errors were encountered: