-
Notifications
You must be signed in to change notification settings - Fork 12k
Eval bug: std::runtime_error Invalid diff: #13876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm getting the same thing. On unsloth's q8 of both qwen3 32b and 30b-a3b. I've bisected it to e121edc from PR #13771. One weird thing I noticed is that it may be using the wrong chat template: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Build command I'm able to fairly consistently reproduce it with this prompt and this
|
Have you tried commit 03f582a? Scratch that. It takes some time, but I'm able to reproduce it after taking quite a few turns. I don't get the diff error anymore though. Trying to see if these are related to iss #13877. 00:26:47 | ~/.bin/cpp/llama.cpp
git:(master | θ) λ gdb --quiet -ex='break main' -ex=run --args llama-server --port 8080 --n-gpu-layers 99 --ctx-size 16384 --pooling mean --slots --jinja -fa
-m /mnt/valerie/models/Qwen/Qwen3-1.7B/ggml-model-f16.gguf
# ...
prompt eval time = 503.52 ms / 425 tokens ( 1.18 ms per token, 844.06 tokens per second)
eval time = 10701.92 ms / 477 tokens ( 22.44 ms per token, 44.57 tokens per second)
total time = 11205.44 ms / 902 tokens
terminate called after throwing an instance of 'std::runtime_error'
what(): </think>
Thread 1 "llama-server" received signal SIGABRT, Aborted.
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
44 return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;
(gdb) bt
#0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1 0x00007ffff53af813 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:89
#2 0x00007ffff5355dc0 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3 0x00007ffff533d57a in __GI_abort () at abort.c:73
#4 0x00007ffff5697bf8 in __gnu_cxx::__verbose_terminate_handler () at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/vterminate.cc:95
#5 0x00007ffff56b1c1a in __cxxabiv1::__terminate (handler=<optimized out>) at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48
#6 0x00007ffff56975db in std::terminate () at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:58
#7 0x00007ffff56b1ed6 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x555555b1d260 <typeinfo for std::runtime_error@GLIBCXX_3.4>, dest=0x7ffff56c99b0 <std::runtime_error::~runtime_error()>) at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_throw.cc:98
#8 0x00005555558aabfa in common_chat_parse (input="<think>\nOkay, the user wants me to read the weather.py file. Let me think about how to approach this.\n\nFirst, I need to figure out the path structure. The user mentioned that the file is small and sel"..., is_partial=false, syntax=...)
at /home/austin/.bin/cpp/llama.cpp/common/chat.cpp:1923
#9 0x0000555555647b2f in server_slot::update_chat_msg (this=0x555555f91e70, diffs=std::vector of length 0, capacity 0) at /home/austin/.bin/cpp/llama.cpp/tools/server/server.cpp:1414
#10 0x00005555556532ff in server_context::send_final_response (this=0x7fffffffc110, slot=...) at /home/austin/.bin/cpp/llama.cpp/tools/server/server.cpp:2521
#11 0x0000555555659056 in server_context::update_slots (this=0x7fffffffc110) at /home/austin/.bin/cpp/llama.cpp/tools/server/server.cpp:3498
#12 0x00005555556003ff in operator() (__closure=0x7fffffffd6f0) at /home/austin/.bin/cpp/llama.cpp/tools/server/server.cpp:4929
#13 0x000055555560e1e8 in std::__invoke_impl<void, main(int, char**)::<lambda()>&>(std::__invoke_other, struct {...} &) (__f=...) at /usr/include/c++/15.1.1/bits/invoke.h:63
#14 0x000055555560c446 in std::__invoke_r<void, main(int, char**)::<lambda()>&>(struct {...} &) (__fn=...) at /usr/include/c++/15.1.1/bits/invoke.h:113
#15 0x0000555555608520 in std::_Function_handler<void(), main(int, char**)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/15.1.1/bits/std_function.h:292
#16 0x000055555565fa98 in std::function<void()>::operator() (this=0x7fffffffd6f0) at /usr/include/c++/15.1.1/bits/std_function.h:593
#17 0x000055555564a4ff in server_queue::start_loop (this=0x7fffffffd5d0) at /home/austin/.bin/cpp/llama.cpp/tools/server/server.cpp:1685
#18 0x0000555555602d6c in main (argc=14, argv=0x7fffffffd9a8) at /home/austin/.bin/cpp/llama.cpp/tools/server/server.cpp:4954
(gdb) Quit
(gdb) quit
A debugging session is active.
Inferior 1 [process 82107] will be killed.
Quit anyway? (y or n) y Can you provide backtraces to reveal where the crash occurred? I think this might be related to chat and chat-parser in common. |
I get the same Here's the trace on 03f582a.
|
I think this is an issue with the chat parser.
These all are related to commits 03f582a..e121edc which was introduced in PR #12379 authored by @ochafik. PR #13786 was supposed to fix it, but I didn't have time to really test it and I'm noticing regressions once again. All of these crashes point towards common/chat and common/chat-parser. PR #13786 weakened the diff so it wouldn't be so strict because this was the original issue with the original draft in PR #12379 - This is why I asked if you tried commit 03f582a specifically. That way, I can see if the regression is really resolved or not. The bt here shows that computing the diff between the input and output streams yields anything and if it does, then it crashes because the input must match the previous history before commit 03f582a. I'm wondering why it's popping up again after this commit which is reported by OP here in a682474. Just crossing the t's and dotting the i's, so-to-speak. I have to work today, so I won't be able to dig into this in depth until later. |
Have the same bug. I tried many versions and found the last working one: b5486. |
Hi @stargate426 , thanks for reporting this! Could you please provide a full @teleprint-me The exception w/ |
I also started seeing hard server invalid diff crashes at b5478 and above i.e. :
b5477 was fine. I don't make use of the openai endpoint. It is crashing in a beam search inference algorithm I added to my downstream server. As I recall it was a lot of work to rebase my downstream server on b5478 so I can't rule out a mistake on my part but given that others are seeing this I think there is a good chance one of the routines changed/added to common in b5478 is the root cause of the problem. |
@ochafik here's a full log log.txt.gz I also discovered that while downgrading to version: 5487 (2f099b5) fixes the issue with qwen3, I'm getting the same |
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
version: 5519 (a682474)
built with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
Ryzen 5 3600 + RTX 5090
Models
Qwen3 32B q5
Problem description & steps to reproduce
./llama-server -m ~/llm/models/Qwen3-32B-Q5_K_S.gguf -c 16384 -ngl 999 --host 0.0.0.0 --port 5000 --jinja --api-key
This is how I run the program, the issue happens every so often and I can't (in the limited attempts I tried) replicate it with llama-cli
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: