Eval bug: API not returning response

### Name and Version

root@f04d38f89dd6:/models# llama-cli --version
version: 1 (8086439)
built with GNU 13.3.0 for Linux x86_64

root@f04d38f89dd6:/models# cat /versions.txt 
llama.cpp: 8086439a4cea94c71a5dfb8fe4ad1546aebd640f

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

0.00.430.444 I   - CUDA0   : Quadro T1000 with Max-Q Design (3714 MiB, 3658 MiB free)
0.00.430.451 I   - CPU     : Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz (15648 MiB, 15648 MiB free)

Also happens on other hardware.

### Models

**Models Affected:**
bartowski/Llama-3.2-1B-Instruct-GGUF:Q4_K_M
bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M
bartowski/Qwen2.5-1.5B-Instruct-GGUF:Q4_K_M
bartowski/Qwen2.5-3B-Instruct-GGUF:Q4_K_M
unsloth/Qwen3-0.6B-GGUF:Q4_K_M - 22-204 tokens generated (incomplete)
unsloth/Qwen3.5-2B-GGUF:Q8_0
unsloth/Qwen3.5-4B-GGUF:Q4_K_M
unsloth/gemma-4-E2B-it-qat-GGUF:Q4_K_XL (claims to have generated 500 tokens but seems to be more like 40)

**Not Affected** - Completed 500 tokens (but sends last message first chunk length)
bartowski/Qwen2.5-0.5B-Instruct-GGUF:Q4_K_M
unsloth/gemma-4-E2B-it-GGUF:Q4_K_M
unsloth/Qwen3-1.7B-GGUF:Q4_K_M
unsloth/Qwen3-4B-GGUF:Q4_K_M
unsloth/gemma-3-4B-it-GGUF:Q4_K_M


### Problem description & steps to reproduce

When I send a large API request to /v1/chat/completions with 10 long messages of about 500 tokens each. The API usually does not respond correctly. Issue seems to happen around 4096 tokens I'm guessing. First noticed the issue in ollama with llama3.2:3b (that was responding with blank responses), so thought I would test it with llama.cpp directly, and see the issue is here but the API does return at least the last assistant message with no added tokens and claims 1 token returned.

Sample JSON request from llama-swap to upstream llama-server:
`curl -o api-response_output1.html -X POST "http://localost:8080/upstream/qwen3.5-4b-q4_k_m/v1/chat/completions" -H "Content-Type: application/json" -d @api-issue.json -i`

[api-issue.json](https://github.com/user-attachments/files/29103617/api-issue.json)

Sample output:
```console
HTTP/1.1 200 OK
Access-Control-Allow-Origin: 
Content-Type: text/event-stream
Server: llama.cpp
X-Accel-Buffering: no
Date: Thu, 18 Jun 2026 16:49:10 GMT
Transfer-Encoding: chunked

:

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1781801390,"id":"chatcmpl-BF4pR1cDP4CFarpRZr48WTSkw1MfKHlW","model":"unsloth/Qwen3.5-4B-GGUF:Q4_K_M","system_fingerprint":"b1-8086439","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"**Ohio (continued)**\n...cattle, which dominated the landscape until the mid-19th century. As the Industrial Revolution took hold, Ohio diversified its economy, with manufacturing, steel production, and innovation driving growth. The state is home to major companies like Procter & Gamble and Goodyear Tire, making it a hub for industry and technology.\n\n**Oklahoma**\nThe Sooner State, where cowboy culture meets energy boom. Founded in 1907 as one of the 46th states, it was previously part of the Oklahoma Territory. The early economy relied on agriculture, particularly wheat and cattle, which dominated the landscape until the mid-20th century. As the Industrial Revolution took hold, Oklahoma diversified its economy, with oil and gas production, mining, and tourism driving growth.\n\n**Oregon**\nThe Beaver State, where logging meets tech innovation. Founded in 1859 as one of the 33rd states, it was previously part of the Oregon Territory. The early economy relied on agriculture, particularly timber and fishing, which dominated the landscape until the mid-20th century. As the Industrial Revolution took hold, Oregon diversified its economy, with manufacturing, forestry, and technology driving growth.\n\n**Pennsylvania**\nThe Keystone State, where industrial history meets medical innovation. Founded in 1787 as one of the original colonies, it was originally known as the Province of Pennsylvania. The early economy relied on trade and commerce, with Philadelphia emerging as a major hub for industry and finance. As the Industrial Revolution took hold, Pennsylvania diversified its economy, with manufacturing, coal mining, and medical research driving growth.\n\n**Rhode Island**\nThe Ocean State, where coastal tourism meets financial hub. Founded in 1790 as one of the original colonies, it was originally known as the Colony of Rhode Island and Providence Plantations. The early economy relied on trade and commerce, with Newport emerging as a major hub for industry and finance. As the Indus"}}],"created":1781801390,"id":"chatcmpl-BF4pR1cDP4CFarpRZr48WTSkw1MfKHlW","model":"unsloth/Qwen3.5-4B-GGUF:Q4_K_M","system_fingerprint":"b1-8086439","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":"stop","index":0,"delta":{}}],"created":1781801390,"id":"chatcmpl-BF4pR1cDP4CFarpRZr48WTSkw1MfKHlW","model":"unsloth/Qwen3.5-4B-GGUF:Q4_K_M","system_fingerprint":"b1-8086439","object":"chat.completion.chunk"}

data: {"choices":[],"created":1781801390,"id":"chatcmpl-BF4pR1cDP4CFarpRZr48WTSkw1MfKHlW","model":"unsloth/Qwen3.5-4B-GGUF:Q4_K_M","system_fingerprint":"b1-8086439","object":"chat.completion.chunk","usage":{"completion_tokens":1,"prompt_tokens":4557,"total_tokens":4558,"prompt_tokens_details":{"cached_tokens":0}},"timings":{"cache_n":0,"prompt_n":4557,"prompt_ms":40035.243,"prompt_per_token_ms":8.785438446346282,"prompt_per_second":113.82471189196977,"predicted_n":1,"predicted_ms":0.001,"predicted_per_token_ms":0.001,"predicted_per_second":1000000.0}}

data: [DONE]


```

### First Bad Commit

_No response_

### Relevant log output

<details>
<summary>Logs</summary>


```console
0.00.084.700 W Setting 'enable_thinking' via --chat-template-kwargs is deprecated. Use --reasoning on / --reasoning off instead.
0.00.290.396 E get_repo_commit: error: HTTPLIB failed: SSL connection failed
0.00.290.405 W get_repo_files: failed to resolve commit for unsloth/Qwen3.5-4B-GGUF
0.00.298.335 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.298.338 I device_info:
0.00.443.807 I   - CUDA0   : Quadro T1000 with Max-Q Design (3714 MiB, 3658 MiB free)
0.00.443.815 I   - CPU     : Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz (15648 MiB, 15648 MiB free)
0.00.443.859 I system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CUDA : ARCHS = 600,610,750,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
0.00.443.860 I srv  llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
0.00.443.877 I srv          init: running without SSL
0.00.443.907 I srv          init: using 11 threads for HTTP server
0.00.444.292 I srv         start: binding port with default address family
0.00.445.472 I srv  llama_server: loading model
0.00.445.474 I srv    load_model: loading model '/models/models--unsloth--Qwen3.5-4B-GGUF/snapshots/e87f176479d0855a907a41277aca2f8ee7a09523/Qwen3.5-4B-Q4_K_M.gguf'
0.01.077.890 I srv    load_model: [mtmd] estimated worst-case memory usage of mmproj is 892.48 MiB
0.01.077.897 I common_init_result: fitting params to device memory ...
0.01.077.897 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.07.264.851 W llama_context: n_ctx_seq (32768) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.07.489.003 W sched_reserve: layer 0 is assigned to device CPU but the fused Gated Delta Net tensor is assigned to device CUDA0 (usually due to missing support)
0.07.489.008 W sched_reserve: fused Gated Delta Net (chunked) not supported, set to disabled
0.07.512.609 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.07.788.030 W load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks
0.07.788.033 W load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024
0.07.788.033 W load_hparams: more info: https://github.com/ggml-org/llama.cpp/issues/16842

0.08.295.264 I srv    load_model: loaded multimodal model, '/models/models--unsloth--Qwen3.5-4B-GGUF/snapshots/e87f176479d0855a907a41277aca2f8ee7a09523/mmproj-BF16.gguf'
0.08.295.271 I srv    load_model: initializing slots, n_slots = 4
0.08.563.834 W srv    load_model: speculative decoding will use checkpoints
0.08.563.846 W common_speculative_init: no implementations specified for speculative decoding
0.08.563.848 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 32768
0.08.563.850 I slot   load_model: id  1 | task -1 | new slot, n_ctx = 32768
0.08.563.850 I slot   load_model: id  2 | task -1 | new slot, n_ctx = 32768
0.08.563.850 I slot   load_model: id  3 | task -1 | new slot, n_ctx = 32768
0.08.564.058 I srv    load_model: prompt cache is enabled, size limit: 8192 MiB
0.08.564.063 I srv    load_model: use `--cache-ram 0` to disable the prompt cache
0.08.564.063 I srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.08.564.064 I srv    load_model: context checkpoints enabled, max = 32, min spacing = 256
0.08.564.087 I srv          init: idle slots will be saved to prompt cache and cleared upon starting a new task
0.08.592.811 I init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>
'
0.08.612.703 I srv          init: init: chat template, thinking = 1
0.08.612.715 I srv  llama_server: model loaded
0.08.612.717 I srv  llama_server: server is listening on http://0.0.0.0:5819
0.08.612.720 I srv  update_slots: all slots are idle
0.09.259.354 I srv  params_from_: Chat format: peg-native
0.09.269.930 I slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = -1
0.09.269.932 I srv  get_availabl: updating prompt cache
0.09.269.935 I srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
0.09.269.937 I srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 32768 tokens, 8589934592 est)
0.09.269.938 I srv  get_availabl: prompt cache update took 0.01 ms
0.09.271.078 I slot launch_slot_: id  3 | task 0 | processing task, is_child = 0
0.09.271.080 I slot process_sing: id  0 | task -1 | saving idle slot to prompt cache
0.09.271.081 I slot prompt_clear: id  0 | task -1 | clearing prompt with 0 tokens
0.09.271.104 I slot process_sing: id  1 | task -1 | saving idle slot to prompt cache
0.09.271.104 I slot prompt_clear: id  1 | task -1 | clearing prompt with 0 tokens
0.09.271.124 I slot process_sing: id  2 | task -1 | saving idle slot to prompt cache
0.09.271.124 I slot prompt_clear: id  2 | task -1 | clearing prompt with 0 tokens
0.25.159.492 I slot print_timing: id  3 | task 0 | prompt processing, n_tokens =   2048, progress = 0.45, t =  15.89 s / 128.90 tokens per second
0.42.851.096 I slot print_timing: id  3 | task 0 | prompt processing, n_tokens =   4041, progress = 0.89, t =  33.58 s / 120.34 tokens per second
0.44.675.360 I slot print_timing: id  3 | task 0 | prompt processing, n_tokens =   4129, progress = 0.91, t =  35.40 s / 116.62 tokens per second
0.44.978.809 I slot create_check: id  3 | task 0 | created context checkpoint 1 of 32 (pos_min = 4128, pos_max = 4128, n_tokens = 4129, size = 50.251 MiB)
0.48.110.112 I slot print_timing: id  3 | task 0 | prompt processing, n_tokens =   4553, progress = 1.00, t =  38.84 s / 117.23 tokens per second
0.49.048.335 I slot create_check: id  3 | task 0 | created context checkpoint 2 of 32 (pos_min = 4552, pos_max = 4552, n_tokens = 4553, size = 50.251 MiB)
0.49.306.403 I slot print_timing: id  3 | task 0 | prompt eval time =   40035.24 ms /  4557 tokens (    8.79 ms per token,   113.82 tokens per second)
0.49.306.406 I slot print_timing: id  3 | task 0 |        eval time =       0.00 ms /     1 tokens (    0.00 ms per token, 1000000.00 tokens per second)
0.49.306.407 I slot print_timing: id  3 | task 0 |       total time =   40035.24 ms /  4558 tokens
0.49.306.408 I slot print_timing: id  3 | task 0 |    graphs reused =          1
0.49.306.610 I slot      release: id  3 | task 0 | stop processing: n_tokens = 4557, truncated = 0
0.49.306.617 I srv  update_slots: all slots are idle
```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: API not returning response #24771

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Eval bug: API not returning response #24771

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions