-
Notifications
You must be signed in to change notification settings - Fork 13k
server : implement prompt processing progress report in stream mode #15827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Georgi Gerganov <[email protected]>
Well that was fast |
Wow, thank you! I look forward to trying this out! |
This works great. It was easy enough to update my client code from how the previous PR worked. Thank you again @ngxson! |
…gml-org#15827) * server : implement `return_progress` * add timings.cache_n * add progress.time_ms * add test * fix test for chat/completions * readme: add docs on timings * use ggml_time_us Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>
One issue is that I cannot see real time tokens/s for a user on webui of server. After this change, I only see token/s at the end of generation. Earlier feature was particularly useful. |
Hmm ok I accidentally remove one line of code, fixing it now |
…gml-org#15827) * server : implement `return_progress` * add timings.cache_n * add progress.time_ms * add test * fix test for chat/completions * readme: add docs on timings * use ggml_time_us Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>
Supersede #14731
By including
"return_progress": true
and"stream": true
in the request, the server will return prompt progress object in the stream.The progress object will look like this:
If part of the message is cached:
For convenient, the number of cached tokens is also added to the
timings
object. This is useful for calculating the context usage after a message is generated. The number of used context tokens is equal toprompt_n + cache_n + predicted_n