Skip to content

Eval bug: gpt-oss model reprocess the entire prompt from beginning. #15894

@narendrachaudhary51

Description

@narendrachaudhary51

Name and Version

I am running on a Emerald Rapids Xeon CPU socket with the following command. Commit id of code is - c252ce6. This is a recent commit.

numactl -m 0 -C 0-59 llama-server --cpu-range 0-59 -m /cold_storage/ml_datasets/narendra/llama.cpp_models/gpt-oss/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4.gguf -t 60 --port 10000 --cont-batching --parallel 8 --ctx-size 65536 --numa numactl --reasoning-format none --jinja --mlock -fa auto --no-mmap --cache-reuse 256 --swa-checkpoints 10

This is the server log for gpt-oss 120b. As you can see n_prompt_tokens goes from 86, 1916, 5877 in three queries and n_tokens=n_prompt_tokens which is also 86, 1916, 5877


srv update_slots: all slots are idle
srv params_from
: Chat format: GPT-OSS
slot launch_slot_: id 3 | task 12914 | processing task
slot update_slots: id 3 | task 12914 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 86
slot update_slots: id 3 | task 12914 | kv cache rm [0, end)
slot update_slots: id 3 | task 12914 | prompt processing progress, n_past = 86, n_tokens = 86, progress = 1.000000
slot update_slots: id 3 | task 12914 | prompt done, n_past = 86, n_tokens = 86
slot update_slots: id 3 | task 12914 | SWA checkpoint create, pos_min = 0, pos_max = 85, size = 3.025 MiB, total = 1/10 (3.025 MiB)
slot release: id 3 | task 12914 | stop processing: n_past = 1967, truncated = 0
slot print_timing: id 3 | task 12914 |
prompt eval time = 474.41 ms / 86 tokens ( 5.52 ms per token, 181.28 tokens per second)
eval time = 48978.99 ms / 1882 tokens ( 26.02 ms per token, 38.42 tokens per second)
total time = 49453.39 ms / 1968 tokens
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv update_slots: all slots are idle
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 2 | task 14797 | processing task
slot update_slots: id 2 | task 14797 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 1916
slot update_slots: id 2 | task 14797 | kv cache rm [0, end)
slot update_slots: id 2 | task 14797 | prompt processing progress, n_past = 1916, n_tokens = 1916, progress = 1.000000
slot update_slots: id 2 | task 14797 | prompt done, n_past = 1916, n_tokens = 1916
slot update_slots: id 2 | task 14797 | SWA checkpoint create, pos_min = 1148, pos_max = 1915, size = 27.009 MiB, total = 1/10 (27.009 MiB)
slot release: id 2 | task 14797 | stop processing: n_past = 6010, truncated = 0
slot print_timing: id 2 | task 14797 |
prompt eval time = 10853.30 ms / 1916 tokens ( 5.66 ms per token, 176.54 tokens per second)
eval time = 125051.40 ms / 4095 tokens ( 30.54 ms per token, 32.75 tokens per second)
total time = 135904.70 ms / 6011 tokens
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv update_slots: all slots are idle
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 1 | task 18893 | processing task
slot update_slots: id 1 | task 18893 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 5877
slot update_slots: id 1 | task 18893 | kv cache rm [0, end)
slot update_slots: id 1 | task 18893 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.348477
slot update_slots: id 1 | task 18893 | kv cache rm [2048, end)
slot update_slots: id 1 | task 18893 | prompt processing progress, n_past = 4096, n_tokens = 2048, progress = 0.696954
slot update_slots: id 1 | task 18893 | kv cache rm [4096, end)
slot update_slots: id 1 | task 18893 | prompt processing progress, n_past = 5877, n_tokens = 1781, progress = 1.000000
slot update_slots: id 1 | task 18893 | prompt done, n_past = 5877, n_tokens = 1781
slot update_slots: id 1 | task 18893 | SWA checkpoint create, pos_min = 5109, pos_max = 5876, size = 27.009 MiB, total = 1/10 (27.009 MiB)
slot release: id 1 | task 18893 | stop processing: n_past = 8191, truncated = 0
slot print_timing: id 1 | task 18893 |
prompt eval time = 41512.46 ms / 5877 tokens ( 7.06 ms per token, 141.57 tokens per second)
eval time = 84743.58 ms / 2315 tokens ( 36.61 ms per token, 27.32 tokens per second)
total time = 126256.05 ms / 8192 tokens
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv update_slots: all slots are idle_

In comparison for qwen3 model (235b-a22)
n_tokens is not equal to n_prompt_tokens

srv update_slots: all slots are idle
srv log_server_r: request: GET / 10.67.162.84 200
srv log_server_r: request: GET /props 10.67.162.84 200
srv params_from
: Chat format: Hermes 2 Pro
slot launch_slot_: id 7 | task 0 | processing task
slot update_slots: id 7 | task 0 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 15
slot update_slots: id 7 | task 0 | kv cache rm [0, end)
slot update_slots: id 7 | task 0 | prompt processing progress, n_past = 15, n_tokens = 15, progress = 1.000000
slot update_slots: id 7 | task 0 | prompt done, n_past = 15, n_tokens = 15
slot release: id 7 | task 0 | stop processing: n_past = 851, truncated = 0
slot print_timing: id 7 | task 0 |
prompt eval time = 426.47 ms / 15 tokens ( 28.43 ms per token, 35.17 tokens per second)
eval time = 73571.96 ms / 837 tokens ( 87.90 ms per token, 11.38 tokens per second)
total time = 73998.42 ms / 852 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 10.67.162.84 200
srv params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id 7 | task 838 | processing task
slot update_slots: id 7 | task 838 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 867
slot update_slots: id 7 | task 838 | kv cache rm [577, end)
slot update_slots: id 7 | task 838 | prompt processing progress, n_past = 867, n_tokens = 290, progress = 0.334487
slot update_slots: id 7 | task 838 | prompt done, n_past = 867, n_tokens = 290
slot release: id 7 | task 838 | stop processing: n_past = 2053, truncated = 0
slot print_timing: id 7 | task 838 |
prompt eval time = 5860.13 ms / 290 tokens ( 20.21 ms per token, 49.49 tokens per second)
eval time = 122713.46 ms / 1187 tokens ( 103.38 ms per token, 9.67 tokens per second)
total time = 128573.58 ms / 1477 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 10.67.162.84 200
srv params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id 7 | task 2026 | processing task
slot update_slots: id 7 | task 2026 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 2070
slot update_slots: id 7 | task 2026 | kv cache rm [2053, end)
slot update_slots: id 7 | task 2026 | prompt processing progress, n_past = 2070, n_tokens = 17, progress = 0.008213
slot update_slots: id 7 | task 2026 | prompt done, n_past = 2070, n_tokens = 17
slot release: id 7 | task 2026 | stop processing: n_past = 3634, truncated = 0
slot print_timing: id 7 | task 2026 |
prompt eval time = 828.94 ms / 17 tokens ( 48.76 ms per token, 20.51 tokens per second)
eval time = 198837.40 ms / 1565 tokens ( 127.05 ms per token, 7.87 tokens per second)
total time = 199666.34 ms / 1582 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 10.67.162.84 200
srv params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id 7 | task 3592 | processing task
slot update_slots: id 7 | task 3592 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 3651
slot update_slots: id 7 | task 3592 | kv cache rm [3634, end)
slot update_slots: id 7 | task 3592 | prompt processing progress, n_past = 3651, n_tokens = 17, progress = 0.004656
slot update_slots: id 7 | task 3592 | prompt done, n_past = 3651, n_tokens = 17
slot release: id 7 | task 3592 | stop processing: n_past = 5185, truncated = 0
slot print_timing: id 7 | task 3592 |
prompt eval time = 945.71 ms / 17 tokens ( 55.63 ms per token, 17.98 tokens per second)
eval time = 230112.39 ms / 1535 tokens ( 149.91 ms per token, 6.67 tokens per second)
total time = 231058.10 ms / 1552 tokens
srv update_slots: all slots are idle_

This issue really kills the user experience for second or third query onwards.

Operating systems

Linux

GGML backends

CPU

Hardware

INTEL(R) XEON(R) PLATINUM 8580 (60 cores) (Emerald Rapids)

Models

gpt-oss 120b

Problem description & steps to reproduce

numactl -m 0 -C 0-59 llama-server --cpu-range 0-59 -m /cold_storage/ml_datasets/narendra/llama.cpp_models/gpt-oss/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4.gguf -t 60 --port 10000 --cont-batching --parallel 8 --ctx-size 65536 --numa numactl --reasoning-format none --jinja --mlock -fa auto --no-mmap --cache-reuse 256 --swa-checkpoints 10

Run three prompts. They can be anything. three examples below

  • Describe intel as a company
  • Describe AMD as a company
  • Describe Apple as a company

First Bad Commit

No response

Relevant log output

srv update_slots: all slots are idle
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 3 | task 12914 | processing task
slot update_slots: id 3 | task 12914 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 86
slot update_slots: id 3 | task 12914 | kv cache rm [0, end)
slot update_slots: id 3 | task 12914 | prompt processing progress, n_past = 86, n_tokens = 86, progress = 1.000000
slot update_slots: id 3 | task 12914 | prompt done, n_past = 86, n_tokens = 86
slot update_slots: id 3 | task 12914 | SWA checkpoint create, pos_min = 0, pos_max = 85, size = 3.025 MiB, total = 1/10 (3.025 MiB)
slot release: id 3 | task 12914 | stop processing: n_past = 1967, truncated = 0
slot print_timing: id 3 | task 12914 |
prompt eval time = 474.41 ms / 86 tokens ( 5.52 ms per token, 181.28 tokens per second)
eval time = 48978.99 ms / 1882 tokens ( 26.02 ms per token, 38.42 tokens per second)
total time = 49453.39 ms / 1968 tokens
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv update_slots: all slots are idle
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 2 | task 14797 | processing task
slot update_slots: id 2 | task 14797 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 1916
slot update_slots: id 2 | task 14797 | kv cache rm [0, end)
slot update_slots: id 2 | task 14797 | prompt processing progress, n_past = 1916, n_tokens = 1916, progress = 1.000000
slot update_slots: id 2 | task 14797 | prompt done, n_past = 1916, n_tokens = 1916
slot update_slots: id 2 | task 14797 | SWA checkpoint create, pos_min = 1148, pos_max = 1915, size = 27.009 MiB, total = 1/10 (27.009 MiB)
slot release: id 2 | task 14797 | stop processing: n_past = 6010, truncated = 0
slot print_timing: id 2 | task 14797 |
prompt eval time = 10853.30 ms / 1916 tokens ( 5.66 ms per token, 176.54 tokens per second)
eval time = 125051.40 ms / 4095 tokens ( 30.54 ms per token, 32.75 tokens per second)
total time = 135904.70 ms / 6011 tokens
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv update_slots: all slots are idle
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 1 | task 18893 | processing task
slot update_slots: id 1 | task 18893 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 5877
slot update_slots: id 1 | task 18893 | kv cache rm [0, end)
slot update_slots: id 1 | task 18893 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.348477
slot update_slots: id 1 | task 18893 | kv cache rm [2048, end)
slot update_slots: id 1 | task 18893 | prompt processing progress, n_past = 4096, n_tokens = 2048, progress = 0.696954
slot update_slots: id 1 | task 18893 | kv cache rm [4096, end)
slot update_slots: id 1 | task 18893 | prompt processing progress, n_past = 5877, n_tokens = 1781, progress = 1.000000
slot update_slots: id 1 | task 18893 | prompt done, n_past = 5877, n_tokens = 1781
slot update_slots: id 1 | task 18893 | SWA checkpoint create, pos_min = 5109, pos_max = 5876, size = 27.009 MiB, total = 1/10 (27.009 MiB)
slot release: id 1 | task 18893 | stop processing: n_past = 8191, truncated = 0
slot print_timing: id 1 | task 18893 |
prompt eval time = 41512.46 ms / 5877 tokens ( 7.06 ms per token, 141.57 tokens per second)
eval time = 84743.58 ms / 2315 tokens ( 36.61 ms per token, 27.32 tokens per second)
total time = 126256.05 ms / 8192 tokens
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv update_slots: all slots are idle

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions