-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Misc. bug: GGML_ASSERT(n <= tokens.size()) failed - Memory in use ('/completion' endpoint and 'cache_prompt=false') #13484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It works with smaller input token sizes, like this 1386 (maybe less than 2048):
|
This bug is reproducible on Nvidia cards as well, with different models, like Qwen Coder: Build options:
Log:
|
Can you try
https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF With the model above, it works well even when I enter 12000 tokens
|
@ngxson: Thanks for your support! I'm using '/completion' endpoint, not '/v1/chat/completions'. Same results with this small model (with or without flash-attention and GPU offloading): Command:
|
I think it's likely that your source code is not clean, try Also maybe you are using the build from other source (indicated by the |
I'm using llama.cpp repo only, the last working release version is b5329. Here's a complete test to reproduce:
|
prompt-v1-chat-completions.json The llama.cpp build is tested against the 'v1/chat/completions' endpoint - it's fine:
Results:
|
Thus, the prompt cache will be the cause. If 'cache_prompt' is set to true, the '/completion' endpoint ('prompt.json') will work again — but this is not a solution for this particular case:
|
Thanks for the info. Yes this seems to be a valid bug in all versions (even the old build) The reason why old version doesn't crash is because std::vector allow resize to an arbitrary number of elements. If we resize to a number larger to number of current tokens in cache, std::value will fill these "added" values with zero, which is technically incorrect. For example, if first request does NOT use cache, this logic will fill up the cache_tokens with wrong values. If the second request decide to use cache, now it will get an incorrect list of cache_tokens. The newer version crashes because I added the check to prevent such case from happening. I'll push a fix for this, cc @ggerganov |
@ngxson Thanks for your support! |
Tested with release b5379 - work like a charm, thanks for your work guys! :) |
Name and Version
llama.cpp version: b5359 (compiled with -DGGML_RPC=ON)
Model: Mistral-Nemo-12B-Instruct-2407-Q8_0.gguf
Command line arguments:
Error: GGML_ASSERT(n <= tokens.size()) failed
Last working version: b5329
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
Error
GGML_ASSERT(n <= tokens.size()) failedslot update_slots
when the input text is long (8241 tokens with 22000 context size)First Bad Commit
33eff40
Relevant log output
The text was updated successfully, but these errors were encountered: