0.00.430.444 I - CUDA0 : Quadro T1000 with Max-Q Design (3714 MiB, 3658 MiB free)
0.00.430.451 I - CPU : Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz (15648 MiB, 15648 MiB free)
Also happens on other hardware.
When I send a large API request to /v1/chat/completions with 10 long messages of about 500 tokens each. The API usually does not respond correctly. Issue seems to happen around 4096 tokens I'm guessing. First noticed the issue in ollama with llama3.2:3b (that was responding with blank responses), so thought I would test it with llama.cpp directly, and see the issue is here but the API does return at least the last assistant message with no added tokens and claims 1 token returned.
HTTP/1.1 200 OK
Access-Control-Allow-Origin:
Content-Type: text/event-stream
Server: llama.cpp
X-Accel-Buffering: no
Date: Thu, 18 Jun 2026 16:49:10 GMT
Transfer-Encoding: chunked
:
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1781801390,"id":"chatcmpl-BF4pR1cDP4CFarpRZr48WTSkw1MfKHlW","model":"unsloth/Qwen3.5-4B-GGUF:Q4_K_M","system_fingerprint":"b1-8086439","object":"chat.completion.chunk"}
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"**Ohio (continued)**\n...cattle, which dominated the landscape until the mid-19th century. As the Industrial Revolution took hold, Ohio diversified its economy, with manufacturing, steel production, and innovation driving growth. The state is home to major companies like Procter & Gamble and Goodyear Tire, making it a hub for industry and technology.\n\n**Oklahoma**\nThe Sooner State, where cowboy culture meets energy boom. Founded in 1907 as one of the 46th states, it was previously part of the Oklahoma Territory. The early economy relied on agriculture, particularly wheat and cattle, which dominated the landscape until the mid-20th century. As the Industrial Revolution took hold, Oklahoma diversified its economy, with oil and gas production, mining, and tourism driving growth.\n\n**Oregon**\nThe Beaver State, where logging meets tech innovation. Founded in 1859 as one of the 33rd states, it was previously part of the Oregon Territory. The early economy relied on agriculture, particularly timber and fishing, which dominated the landscape until the mid-20th century. As the Industrial Revolution took hold, Oregon diversified its economy, with manufacturing, forestry, and technology driving growth.\n\n**Pennsylvania**\nThe Keystone State, where industrial history meets medical innovation. Founded in 1787 as one of the original colonies, it was originally known as the Province of Pennsylvania. The early economy relied on trade and commerce, with Philadelphia emerging as a major hub for industry and finance. As the Industrial Revolution took hold, Pennsylvania diversified its economy, with manufacturing, coal mining, and medical research driving growth.\n\n**Rhode Island**\nThe Ocean State, where coastal tourism meets financial hub. Founded in 1790 as one of the original colonies, it was originally known as the Colony of Rhode Island and Providence Plantations. The early economy relied on trade and commerce, with Newport emerging as a major hub for industry and finance. As the Indus"}}],"created":1781801390,"id":"chatcmpl-BF4pR1cDP4CFarpRZr48WTSkw1MfKHlW","model":"unsloth/Qwen3.5-4B-GGUF:Q4_K_M","system_fingerprint":"b1-8086439","object":"chat.completion.chunk"}
data: {"choices":[{"finish_reason":"stop","index":0,"delta":{}}],"created":1781801390,"id":"chatcmpl-BF4pR1cDP4CFarpRZr48WTSkw1MfKHlW","model":"unsloth/Qwen3.5-4B-GGUF:Q4_K_M","system_fingerprint":"b1-8086439","object":"chat.completion.chunk"}
data: {"choices":[],"created":1781801390,"id":"chatcmpl-BF4pR1cDP4CFarpRZr48WTSkw1MfKHlW","model":"unsloth/Qwen3.5-4B-GGUF:Q4_K_M","system_fingerprint":"b1-8086439","object":"chat.completion.chunk","usage":{"completion_tokens":1,"prompt_tokens":4557,"total_tokens":4558,"prompt_tokens_details":{"cached_tokens":0}},"timings":{"cache_n":0,"prompt_n":4557,"prompt_ms":40035.243,"prompt_per_token_ms":8.785438446346282,"prompt_per_second":113.82471189196977,"predicted_n":1,"predicted_ms":0.001,"predicted_per_token_ms":0.001,"predicted_per_second":1000000.0}}
data: [DONE]
Logs
0.00.084.700 W Setting 'enable_thinking' via --chat-template-kwargs is deprecated. Use --reasoning on / --reasoning off instead.
0.00.290.396 E get_repo_commit: error: HTTPLIB failed: SSL connection failed
0.00.290.405 W get_repo_files: failed to resolve commit for unsloth/Qwen3.5-4B-GGUF
0.00.298.335 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.298.338 I device_info:
0.00.443.807 I - CUDA0 : Quadro T1000 with Max-Q Design (3714 MiB, 3658 MiB free)
0.00.443.815 I - CPU : Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz (15648 MiB, 15648 MiB free)
0.00.443.859 I system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CUDA : ARCHS = 600,610,750,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.443.860 I srv llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
0.00.443.877 I srv init: running without SSL
0.00.443.907 I srv init: using 11 threads for HTTP server
0.00.444.292 I srv start: binding port with default address family
0.00.445.472 I srv llama_server: loading model
0.00.445.474 I srv load_model: loading model '/models/models--unsloth--Qwen3.5-4B-GGUF/snapshots/e87f176479d0855a907a41277aca2f8ee7a09523/Qwen3.5-4B-Q4_K_M.gguf'
0.01.077.890 I srv load_model: [mtmd] estimated worst-case memory usage of mmproj is 892.48 MiB
0.01.077.897 I common_init_result: fitting params to device memory ...
0.01.077.897 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.07.264.851 W llama_context: n_ctx_seq (32768) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.07.489.003 W sched_reserve: layer 0 is assigned to device CPU but the fused Gated Delta Net tensor is assigned to device CUDA0 (usually due to missing support)
0.07.489.008 W sched_reserve: fused Gated Delta Net (chunked) not supported, set to disabled
0.07.512.609 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.07.788.030 W load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks
0.07.788.033 W load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024
0.07.788.033 W load_hparams: more info: https://github.com/ggml-org/llama.cpp/issues/16842
0.08.295.264 I srv load_model: loaded multimodal model, '/models/models--unsloth--Qwen3.5-4B-GGUF/snapshots/e87f176479d0855a907a41277aca2f8ee7a09523/mmproj-BF16.gguf'
0.08.295.271 I srv load_model: initializing slots, n_slots = 4
0.08.563.834 W srv load_model: speculative decoding will use checkpoints
0.08.563.846 W common_speculative_init: no implementations specified for speculative decoding
0.08.563.848 I slot load_model: id 0 | task -1 | new slot, n_ctx = 32768
0.08.563.850 I slot load_model: id 1 | task -1 | new slot, n_ctx = 32768
0.08.563.850 I slot load_model: id 2 | task -1 | new slot, n_ctx = 32768
0.08.563.850 I slot load_model: id 3 | task -1 | new slot, n_ctx = 32768
0.08.564.058 I srv load_model: prompt cache is enabled, size limit: 8192 MiB
0.08.564.063 I srv load_model: use `--cache-ram 0` to disable the prompt cache
0.08.564.063 I srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.08.564.064 I srv load_model: context checkpoints enabled, max = 32, min spacing = 256
0.08.564.087 I srv init: idle slots will be saved to prompt cache and cleared upon starting a new task
0.08.592.811 I init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>
'
0.08.612.703 I srv init: init: chat template, thinking = 1
0.08.612.715 I srv llama_server: model loaded
0.08.612.717 I srv llama_server: server is listening on http://0.0.0.0:5819
0.08.612.720 I srv update_slots: all slots are idle
0.09.259.354 I srv params_from_: Chat format: peg-native
0.09.269.930 I slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1
0.09.269.932 I srv get_availabl: updating prompt cache
0.09.269.935 I srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
0.09.269.937 I srv update: - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 32768 tokens, 8589934592 est)
0.09.269.938 I srv get_availabl: prompt cache update took 0.01 ms
0.09.271.078 I slot launch_slot_: id 3 | task 0 | processing task, is_child = 0
0.09.271.080 I slot process_sing: id 0 | task -1 | saving idle slot to prompt cache
0.09.271.081 I slot prompt_clear: id 0 | task -1 | clearing prompt with 0 tokens
0.09.271.104 I slot process_sing: id 1 | task -1 | saving idle slot to prompt cache
0.09.271.104 I slot prompt_clear: id 1 | task -1 | clearing prompt with 0 tokens
0.09.271.124 I slot process_sing: id 2 | task -1 | saving idle slot to prompt cache
0.09.271.124 I slot prompt_clear: id 2 | task -1 | clearing prompt with 0 tokens
0.25.159.492 I slot print_timing: id 3 | task 0 | prompt processing, n_tokens = 2048, progress = 0.45, t = 15.89 s / 128.90 tokens per second
0.42.851.096 I slot print_timing: id 3 | task 0 | prompt processing, n_tokens = 4041, progress = 0.89, t = 33.58 s / 120.34 tokens per second
0.44.675.360 I slot print_timing: id 3 | task 0 | prompt processing, n_tokens = 4129, progress = 0.91, t = 35.40 s / 116.62 tokens per second
0.44.978.809 I slot create_check: id 3 | task 0 | created context checkpoint 1 of 32 (pos_min = 4128, pos_max = 4128, n_tokens = 4129, size = 50.251 MiB)
0.48.110.112 I slot print_timing: id 3 | task 0 | prompt processing, n_tokens = 4553, progress = 1.00, t = 38.84 s / 117.23 tokens per second
0.49.048.335 I slot create_check: id 3 | task 0 | created context checkpoint 2 of 32 (pos_min = 4552, pos_max = 4552, n_tokens = 4553, size = 50.251 MiB)
0.49.306.403 I slot print_timing: id 3 | task 0 | prompt eval time = 40035.24 ms / 4557 tokens ( 8.79 ms per token, 113.82 tokens per second)
0.49.306.406 I slot print_timing: id 3 | task 0 | eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, 1000000.00 tokens per second)
0.49.306.407 I slot print_timing: id 3 | task 0 | total time = 40035.24 ms / 4558 tokens
0.49.306.408 I slot print_timing: id 3 | task 0 | graphs reused = 1
0.49.306.610 I slot release: id 3 | task 0 | stop processing: n_tokens = 4557, truncated = 0
0.49.306.617 I srv update_slots: all slots are idle
Name and Version
root@f04d38f89dd6:/models# llama-cli --version
version: 1 (8086439)
built with GNU 13.3.0 for Linux x86_64
root@f04d38f89dd6:/models# cat /versions.txt
llama.cpp: 8086439
Operating systems
Linux
GGML backends
CUDA
Hardware
0.00.430.444 I - CUDA0 : Quadro T1000 with Max-Q Design (3714 MiB, 3658 MiB free)
0.00.430.451 I - CPU : Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz (15648 MiB, 15648 MiB free)
Also happens on other hardware.
Models
Models Affected:
bartowski/Llama-3.2-1B-Instruct-GGUF:Q4_K_M
bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M
bartowski/Qwen2.5-1.5B-Instruct-GGUF:Q4_K_M
bartowski/Qwen2.5-3B-Instruct-GGUF:Q4_K_M
unsloth/Qwen3-0.6B-GGUF:Q4_K_M - 22-204 tokens generated (incomplete)
unsloth/Qwen3.5-2B-GGUF:Q8_0
unsloth/Qwen3.5-4B-GGUF:Q4_K_M
unsloth/gemma-4-E2B-it-qat-GGUF:Q4_K_XL (claims to have generated 500 tokens but seems to be more like 40)
Not Affected - Completed 500 tokens (but sends last message first chunk length)
bartowski/Qwen2.5-0.5B-Instruct-GGUF:Q4_K_M
unsloth/gemma-4-E2B-it-GGUF:Q4_K_M
unsloth/Qwen3-1.7B-GGUF:Q4_K_M
unsloth/Qwen3-4B-GGUF:Q4_K_M
unsloth/gemma-3-4B-it-GGUF:Q4_K_M
Problem description & steps to reproduce
When I send a large API request to /v1/chat/completions with 10 long messages of about 500 tokens each. The API usually does not respond correctly. Issue seems to happen around 4096 tokens I'm guessing. First noticed the issue in ollama with llama3.2:3b (that was responding with blank responses), so thought I would test it with llama.cpp directly, and see the issue is here but the API does return at least the last assistant message with no added tokens and claims 1 token returned.
Sample JSON request from llama-swap to upstream llama-server:
curl -o api-response_output1.html -X POST "http://localost:8080/upstream/qwen3.5-4b-q4_k_m/v1/chat/completions" -H "Content-Type: application/json" -d @api-issue.json -iapi-issue.json
Sample output:
First Bad Commit
No response
Relevant log output
Logs