You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
server : implement prompt processing progress report in stream mode (#15827)
* server : implement `return_progress`
* add timings.cache_n
* add progress.time_ms
* add test
* fix test for chat/completions
* readme: add docs on timings
* use ggml_time_us
Co-authored-by: Georgi Gerganov <[email protected]>
---------
Co-authored-by: Georgi Gerganov <[email protected]>
Copy file name to clipboardExpand all lines: tools/server/README.md
+30Lines changed: 30 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -512,6 +512,8 @@ These words will not be included in the completion, so make sure to add them to
512
512
513
513
`timings_per_token`: Include prompt processing and text generation speed information in each response. Default: `false`
514
514
515
+
`return_progress`: Include prompt processing progress in `stream` mode. The progress will be contained inside `prompt_progress` with 3 values: `total`, `cache` and `processed`. The overall progress is `processed/total`, while the actual timed progress is `(processed-cache)/(total-cache)`. Default: `false`
516
+
515
517
`post_sampling_probs`: Returns the probabilities of top `n_probs` tokens after applying sampling chain.
516
518
517
519
`response_fields`: A list of response fields, for example: `"response_fields": ["content", "generation_settings/n_predict"]`. If the specified field is missing, it will simply be omitted from the response without triggering an error. Note that fields with a slash will be unnested; for example, `generation_settings/n_predict` will move the field `n_predict` from the `generation_settings` object to the root of the response and give it a new name.
**See our [Function calling](../../docs/function-calling.md) docs** for more details, supported native tool call styles (generic tool call style is used as fallback) / examples of use.
1278
1280
1281
+
*Timings and context usage*
1282
+
1283
+
The response contains a `timings` object, for example:
1284
+
1285
+
```js
1286
+
{
1287
+
"choices": [],
1288
+
"created":1757141666,
1289
+
"id":"chatcmpl-ecQULm0WqPrftUqjPZO1CFYeDjGZNbDu",
1290
+
// ...
1291
+
"timings": {
1292
+
"cache_n":236, // number of prompt tokens reused from cache
1293
+
"prompt_n":1, // number of prompt tokens being processed
1294
+
"prompt_ms":30.958,
1295
+
"prompt_per_token_ms":30.958,
1296
+
"prompt_per_second":32.301828283480845,
1297
+
"predicted_n":35, // number of predicted tokens
1298
+
"predicted_ms":661.064,
1299
+
"predicted_per_token_ms":18.887542857142858,
1300
+
"predicted_per_second":52.94494935437416
1301
+
}
1302
+
}
1303
+
```
1304
+
1305
+
This provides information on the performance of the server. It also allows calculating the current context usage.
1306
+
1307
+
The total number of tokens in context is equal to `prompt_n + cache_n + predicted_n`
1308
+
1279
1309
### POST `/v1/embeddings`: OpenAI-compatible embeddings API
1280
1310
1281
1311
This endpoint requires that the model uses a pooling different than type `none`. The embeddings are normalized using the Eucledian norm.
0 commit comments