-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Segmentation fault while generating. #218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I suspect you're hitting some internal memory buffer limit in Given that you only have 4GB of VRAM, are you setting If you have a single stand-alone python script that generates the error, I can try and reproduce with my NVidia GPU. If I can't repro it points to CLBlast as part of the issue. Finally, stupid question, but did you use the exact same params and prompt length with |
Using
I specified 32
https://gist.github.com/Firstbober/a08de9cf01ea90b6be8389be9a249857
Yes |
I modified your script to take the model from
I suspect the CLBast support is very new and somewhat unstable, given that most devs (including @ggerganov) are using NVidia GPUs, sorry. |
Well, I compiled libllama.so without the support for CLBlast and segmentation fault still persists. |
I pinned the problem down to the |
Just spent a few hours debugging on a related issue. The This llama.cpp commit removes the So this code in llama-cpp-python is now invalid when paired with llama.cpp mainline: llama-cpp-python/llama_cpp/llama_cpp.py Line 116 in 01a010b
Deleting this line fixes the issue. For me, it manifested as |
As a side note, I don't think token generation is actually accelerated in CLBlast yet? Its behind a PR in the llama.cpp repo, and my observation is that the GPU has no load no matter what I set n_gpu_layers to... but maybe something was wrong with my quick CLBlast test. Point being that maybe this has nothing to do with the GPU. |
ggml-org/llama.cpp#1459 adds fairly good OpenCL support, was merged 5 days ago. Also in the readme it now says "The CLBlast build supports --gpu-layers|-ngl like the CUDA version does." I've tested the win clblast builds, and they works pretty well on my 3080, with 250ms per token with some offloading, and 450ms without offloading. With that said, I can't get it to work with llama-cpp-python. It seems to ignore gpu layers with clblast. |
I confirmed that the latest
|
Closing. Please update to the latest |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Continue the generation and gracefully exit.
Current Behavior
Segmentation fault while generating tokens. It usually happens after generating ~121 tokens (I did 4 different prompts which crashed at token 122, 121, 118 and 124), and it doesn't seem to happen in the llama.cpp
./main
example.Environment and Context
I am utilizing context size of 512, prediction 256 and batch 1024. The rest of the settings are default. I am also utilizing CLBlast which on llama.cpp gives me 2.5x boost in performence. I am also using libllama.so built from the latest
llama.cpp
source, so I can debug it with gdb.Linux bober-desktop 6.3.1-x64v1-xanmod1-2 #1 SMP PREEMPT_DYNAMIC Sun, 07 May 2023 10:32:57 +0000 x86_64 GNU/Linux
Failure Information (for bugs)
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
llamaChat.load_context
with some lengthy prompt (mine has 1300 characters)llamaChat.generate
try to generate something, I used this piece of code:The text was updated successfully, but these errors were encountered: