Skip to content

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Sep 6, 2025

Supersede #14731

By including "return_progress": true and "stream": true in the request, the server will return prompt progress object in the stream.

The progress object will look like this:

"prompt_progress":{"total":237,"cache":0,"processed":128,"time_ms":181}

If part of the message is cached:

"prompt_progress":{"total":237,"cache":230,"processed":237,"time_ms":30}

For convenient, the number of cached tokens is also added to the timings object. This is useful for calculating the context usage after a message is generated. The number of used context tokens is equal to prompt_n + cache_n + predicted_n

{
  "choices": [],
  "created": 1757141666,
  "id": "chatcmpl-ecQULm0WqPrftUqjPZO1CFYeDjGZNbDu",
  ...
  "timings": {
    "cache_n": 236,
    "prompt_n": 1,
    "prompt_ms": 30.958,
    "prompt_per_token_ms": 30.958,
    "prompt_per_second": 32.301828283480845,
    "predicted_n": 35,
    "predicted_ms": 661.064,
    "predicted_per_token_ms": 18.887542857142858,
    "predicted_per_second": 52.94494935437416
  }
}

@github-actions github-actions bot added the python python script changes label Sep 6, 2025
@ngxson ngxson marked this pull request as ready for review September 6, 2025 10:28
@ngxson ngxson requested review from ggerganov and allozaur September 6, 2025 10:36
Co-authored-by: Georgi Gerganov <[email protected]>
@ngxson ngxson merged commit 61bdfd5 into master Sep 6, 2025
49 of 50 checks passed
@ExtReMLapin
Copy link
Contributor

Well that was fast

@BradHutchings
Copy link

Wow, thank you! I look forward to trying this out!

@BradHutchings
Copy link

This works great. It was easy enough to update my client code from how the previous PR worked. Thank you again @ngxson!

walidbr pushed a commit to walidbr/llama.cpp that referenced this pull request Sep 7, 2025
…gml-org#15827)

* server : implement `return_progress`

* add timings.cache_n

* add progress.time_ms

* add test

* fix test for chat/completions

* readme: add docs on timings

* use ggml_time_us

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>
@narendrachaudhary51
Copy link

One issue is that I cannot see real time tokens/s for a user on webui of server. After this change, I only see token/s at the end of generation. Earlier feature was particularly useful.

@ngxson
Copy link
Collaborator Author

ngxson commented Sep 8, 2025

One issue is that I cannot see real time tokens/s for a user on webui of server. After this change, I only see token/s at the end of generation. Earlier feature was particularly useful.

Hmm ok I accidentally remove one line of code, fixing it now

njsyw1997 pushed a commit to aizip/llama.cpp that referenced this pull request Sep 10, 2025
…gml-org#15827)

* server : implement `return_progress`

* add timings.cache_n

* add progress.time_ms

* add test

* fix test for chat/completions

* readme: add docs on timings

* use ggml_time_us

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants