Eval bug: gpt-oss model reprocess the entire prompt from beginning.

### Name and Version

I am running on a Emerald Rapids Xeon CPU socket with the following command. Commit id of code is - https://github.com/ggml-org/llama.cpp/commit/c252ce67c4b99f056eeb18d42a289e28fee03475. This is a recent commit.

_numactl -m 0 -C 0-59 llama-server --cpu-range 0-59 -m /cold_storage/ml_datasets/narendra/llama.cpp_models/gpt-oss/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4.gguf -t 60 --port 10000 --cont-batching --parallel 8 --ctx-size 65536 --numa numactl --reasoning-format none --jinja --mlock -fa auto --no-mmap --cache-reuse 256 --swa-checkpoints 10_

This is the server log for gpt-oss 120b. As you can see n_prompt_tokens goes from 86, 1916, 5877 in three queries and n_tokens=n_prompt_tokens which is also 86, 1916, 5877

------------------------------------------------------

_srv update_slots: all slots are idle
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 3 | task 12914 | processing task
slot update_slots: id 3 | task 12914 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 86
slot update_slots: id 3 | task 12914 | kv cache rm [0, end)
slot update_slots: id 3 | task 12914 | prompt processing progress, n_past = 86, n_tokens = 86, progress = 1.000000
slot update_slots: id 3 | task 12914 | prompt done, n_past = 86, n_tokens = 86
slot update_slots: id 3 | task 12914 | SWA checkpoint create, pos_min = 0, pos_max = 85, size = 3.025 MiB, total = 1/10 (3.025 MiB)
slot release: id 3 | task 12914 | stop processing: n_past = 1967, truncated = 0
slot print_timing: id 3 | task 12914 |
prompt eval time = 474.41 ms / 86 tokens ( 5.52 ms per token, 181.28 tokens per second)
eval time = 48978.99 ms / 1882 tokens ( 26.02 ms per token, 38.42 tokens per second)
total time = 49453.39 ms / 1968 tokens
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv update_slots: all slots are idle
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 2 | task 14797 | processing task
slot update_slots: id 2 | task 14797 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 1916
slot update_slots: id 2 | task 14797 | kv cache rm [0, end)
slot update_slots: id 2 | task 14797 | prompt processing progress, n_past = 1916, n_tokens = 1916, progress = 1.000000
slot update_slots: id 2 | task 14797 | prompt done, n_past = 1916, n_tokens = 1916
slot update_slots: id 2 | task 14797 | SWA checkpoint create, pos_min = 1148, pos_max = 1915, size = 27.009 MiB, total = 1/10 (27.009 MiB)
slot release: id 2 | task 14797 | stop processing: n_past = 6010, truncated = 0
slot print_timing: id 2 | task 14797 |
prompt eval time = 10853.30 ms / 1916 tokens ( 5.66 ms per token, 176.54 tokens per second)
eval time = 125051.40 ms / 4095 tokens ( 30.54 ms per token, 32.75 tokens per second)
total time = 135904.70 ms / 6011 tokens
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv update_slots: all slots are idle
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 1 | task 18893 | processing task
slot update_slots: id 1 | task 18893 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 5877
slot update_slots: id 1 | task 18893 | kv cache rm [0, end)
slot update_slots: id 1 | task 18893 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.348477
slot update_slots: id 1 | task 18893 | kv cache rm [2048, end)
slot update_slots: id 1 | task 18893 | prompt processing progress, n_past = 4096, n_tokens = 2048, progress = 0.696954
slot update_slots: id 1 | task 18893 | kv cache rm [4096, end)
slot update_slots: id 1 | task 18893 | prompt processing progress, n_past = 5877, n_tokens = 1781, progress = 1.000000
slot update_slots: id 1 | task 18893 | prompt done, n_past = 5877, n_tokens = 1781
slot update_slots: id 1 | task 18893 | SWA checkpoint create, pos_min = 5109, pos_max = 5876, size = 27.009 MiB, total = 1/10 (27.009 MiB)
slot release: id 1 | task 18893 | stop processing: n_past = 8191, truncated = 0
slot print_timing: id 1 | task 18893 |
prompt eval time = 41512.46 ms / 5877 tokens ( 7.06 ms per token, 141.57 tokens per second)
eval time = 84743.58 ms / 2315 tokens ( 36.61 ms per token, 27.32 tokens per second)
total time = 126256.05 ms / 8192 tokens
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv update_slots: all slots are idle_


In comparison for qwen3 model (235b-a22)
n_tokens is not equal to n_prompt_tokens
---------------------

_srv  update_slots: all slots are idle
srv  log_server_r: request: GET / 10.67.162.84 200
srv  log_server_r: request: GET /props 10.67.162.84 200
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  7 | task 0 | processing task
slot update_slots: id  7 | task 0 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 15
slot update_slots: id  7 | task 0 | kv cache rm [0, end)
slot update_slots: id  7 | task 0 | prompt processing progress, n_past = 15, n_tokens = 15, progress = 1.000000
slot update_slots: id  7 | task 0 | prompt done, n_past = 15, n_tokens = 15
slot      release: id  7 | task 0 | stop processing: n_past = 851, truncated = 0
slot print_timing: id  7 | task 0 |
prompt eval time =     426.47 ms /    15 tokens (   28.43 ms per token,    35.17 tokens per second)
       eval time =   73571.96 ms /   837 tokens (   87.90 ms per token,    11.38 tokens per second)
      total time =   73998.42 ms /   852 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 10.67.162.84 200
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  7 | task 838 | processing task
slot update_slots: id  7 | task 838 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 867
slot update_slots: id  7 | task 838 | kv cache rm [577, end)
**slot update_slots: id  7 | task 838 | prompt processing progress, n_past = 867, n_tokens = 290, progress = 0.334487**
slot update_slots: id  7 | task 838 | prompt done, n_past = 867, n_tokens = 290
slot      release: id  7 | task 838 | stop processing: n_past = 2053, truncated = 0
slot print_timing: id  7 | task 838 |
prompt eval time =    5860.13 ms /   290 tokens (   20.21 ms per token,    49.49 tokens per second)
       eval time =  122713.46 ms /  1187 tokens (  103.38 ms per token,     9.67 tokens per second)
      total time =  128573.58 ms /  1477 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 10.67.162.84 200
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  7 | task 2026 | processing task
slot update_slots: id  7 | task 2026 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 2070
slot update_slots: id  7 | task 2026 | kv cache rm [2053, end)
slot update_slots: id  7 | task 2026 | prompt processing progress, n_past = 2070, n_tokens = 17, progress = 0.008213
slot update_slots: id  7 | task 2026 | prompt done, n_past = 2070, n_tokens = 17
slot      release: id  7 | task 2026 | stop processing: n_past = 3634, truncated = 0
slot print_timing: id  7 | task 2026 |
prompt eval time =     828.94 ms /    17 tokens (   48.76 ms per token,    20.51 tokens per second)
       eval time =  198837.40 ms /  1565 tokens (  127.05 ms per token,     7.87 tokens per second)
      total time =  199666.34 ms /  1582 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 10.67.162.84 200
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  7 | task 3592 | processing task
slot update_slots: id  7 | task 3592 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 3651
slot update_slots: id  7 | task 3592 | kv cache rm [3634, end)
slot update_slots: id  7 | task 3592 | prompt processing progress, n_past = 3651, n_tokens = 17, progress = 0.004656
slot update_slots: id  7 | task 3592 | prompt done, n_past = 3651, n_tokens = 17
slot      release: id  7 | task 3592 | stop processing: n_past = 5185, truncated = 0
slot print_timing: id  7 | task 3592 |
prompt eval time =     945.71 ms /    17 tokens (   55.63 ms per token,    17.98 tokens per second)
       eval time =  230112.39 ms /  1535 tokens (  149.91 ms per token,     6.67 tokens per second)
      total time =  231058.10 ms /  1552 tokens
srv  update_slots: all slots are idle_


This issue really kills the user experience for second or third query onwards.


### Operating systems

Linux

### GGML backends

CPU

### Hardware

INTEL(R) XEON(R) PLATINUM 8580 (60 cores) (Emerald Rapids)

### Models

gpt-oss 120b

### Problem description & steps to reproduce

numactl -m 0 -C 0-59 llama-server --cpu-range 0-59 -m /cold_storage/ml_datasets/narendra/llama.cpp_models/gpt-oss/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4.gguf -t 60 --port 10000 --cont-batching --parallel 8 --ctx-size 65536 --numa numactl --reasoning-format none --jinja --mlock -fa auto --no-mmap --cache-reuse 256 --swa-checkpoints 10

Run three prompts. They can be anything. three examples below

- Describe intel as a company
- Describe AMD as a company
- Describe Apple as a company

### First Bad Commit

_No response_

### Relevant log output

```shell
srv update_slots: all slots are idle
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 3 | task 12914 | processing task
slot update_slots: id 3 | task 12914 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 86
slot update_slots: id 3 | task 12914 | kv cache rm [0, end)
slot update_slots: id 3 | task 12914 | prompt processing progress, n_past = 86, n_tokens = 86, progress = 1.000000
slot update_slots: id 3 | task 12914 | prompt done, n_past = 86, n_tokens = 86
slot update_slots: id 3 | task 12914 | SWA checkpoint create, pos_min = 0, pos_max = 85, size = 3.025 MiB, total = 1/10 (3.025 MiB)
slot release: id 3 | task 12914 | stop processing: n_past = 1967, truncated = 0
slot print_timing: id 3 | task 12914 |
prompt eval time = 474.41 ms / 86 tokens ( 5.52 ms per token, 181.28 tokens per second)
eval time = 48978.99 ms / 1882 tokens ( 26.02 ms per token, 38.42 tokens per second)
total time = 49453.39 ms / 1968 tokens
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv update_slots: all slots are idle
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 2 | task 14797 | processing task
slot update_slots: id 2 | task 14797 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 1916
slot update_slots: id 2 | task 14797 | kv cache rm [0, end)
slot update_slots: id 2 | task 14797 | prompt processing progress, n_past = 1916, n_tokens = 1916, progress = 1.000000
slot update_slots: id 2 | task 14797 | prompt done, n_past = 1916, n_tokens = 1916
slot update_slots: id 2 | task 14797 | SWA checkpoint create, pos_min = 1148, pos_max = 1915, size = 27.009 MiB, total = 1/10 (27.009 MiB)
slot release: id 2 | task 14797 | stop processing: n_past = 6010, truncated = 0
slot print_timing: id 2 | task 14797 |
prompt eval time = 10853.30 ms / 1916 tokens ( 5.66 ms per token, 176.54 tokens per second)
eval time = 125051.40 ms / 4095 tokens ( 30.54 ms per token, 32.75 tokens per second)
total time = 135904.70 ms / 6011 tokens
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv update_slots: all slots are idle
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 1 | task 18893 | processing task
slot update_slots: id 1 | task 18893 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 5877
slot update_slots: id 1 | task 18893 | kv cache rm [0, end)
slot update_slots: id 1 | task 18893 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.348477
slot update_slots: id 1 | task 18893 | kv cache rm [2048, end)
slot update_slots: id 1 | task 18893 | prompt processing progress, n_past = 4096, n_tokens = 2048, progress = 0.696954
slot update_slots: id 1 | task 18893 | kv cache rm [4096, end)
slot update_slots: id 1 | task 18893 | prompt processing progress, n_past = 5877, n_tokens = 1781, progress = 1.000000
slot update_slots: id 1 | task 18893 | prompt done, n_past = 5877, n_tokens = 1781
slot update_slots: id 1 | task 18893 | SWA checkpoint create, pos_min = 5109, pos_max = 5876, size = 27.009 MiB, total = 1/10 (27.009 MiB)
slot release: id 1 | task 18893 | stop processing: n_past = 8191, truncated = 0
slot print_timing: id 1 | task 18893 |
prompt eval time = 41512.46 ms / 5877 tokens ( 7.06 ms per token, 141.57 tokens per second)
eval time = 84743.58 ms / 2315 tokens ( 36.61 ms per token, 27.32 tokens per second)
total time = 126256.05 ms / 8192 tokens
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv update_slots: all slots are idle
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: gpt-oss model reprocess the entire prompt from beginning. #15894

Name and Version

In comparison for qwen3 model (235b-a22)
n_tokens is not equal to n_prompt_tokens

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: gpt-oss model reprocess the entire prompt from beginning. #15894

Description

Name and Version

In comparison for qwen3 model (235b-a22) n_tokens is not equal to n_prompt_tokens

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

In comparison for qwen3 model (235b-a22)
n_tokens is not equal to n_prompt_tokens