You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/backend/sampling_params.md
+40-30Lines changed: 40 additions & 30 deletions
Original file line number
Diff line number
Diff line change
@@ -8,52 +8,62 @@ If you want a high-level endpoint that can automatically handle chat templates,
8
8
9
9
The `/generate` endpoint accepts the following parameters in JSON format. For detailed usage, see the [native API doc](./native_api.ipynb).
10
10
11
-
*`text: Optional[Union[List[str], str]] = None` The input prompt. Can be a single prompt or a batch of prompts.
12
-
*`input_ids: Optional[Union[List[List[int]], List[int]]] = None` Alternative to `text`. Specify the input as token IDs instead of text.
13
-
*`sampling_params: Optional[Union[List[Dict], Dict]] = None` The sampling parameters as described in the sections below.
14
-
*`return_logprob: Optional[Union[List[bool], bool]] = None` Whether to return log probabilities for tokens.
15
-
*`logprob_start_len: Optional[Union[List[int], int]] = None` If returning log probabilities, specifies the start position in the prompt. Default is "-1", which returns logprobs only for output tokens.
16
-
*`top_logprobs_num: Optional[Union[List[int], int]] = None` If returning log probabilities, specifies the number of top logprobs to return at each position.
17
-
*`stream: bool = False` Whether to stream the output.
18
-
*`lora_path: Optional[Union[List[Optional[str]], Optional[str]]] = None` Path to LoRA weights.
19
-
*`custom_logit_processor: Optional[Union[List[Optional[str]], str]] = None` Custom logit processor for advanced sampling control. For usage see below.
20
-
*`return_hidden_states: bool = False` Whether to return hidden states of the model. Note that each time it changes, the CUDA graph will be recaptured, which might lead to a performance hit. See the [examples](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states) for more information.
| text |`Optional[Union[List[str], str]] = None`| The input prompt. Can be a single prompt or a batch of prompts. |
14
+
| input_ids |`Optional[Union[List[List[int]], List[int]]] = None`| Alternative to `text`. Specify the input as token IDs instead of text. |
15
+
| sampling_params |`Optional[Union[List[Dict], Dict]] = None`| The sampling parameters as described in the sections below. |
16
+
| return_logprob |`Optional[Union[List[bool], bool]] = None`| Whether to return log probabilities for tokens. |
17
+
| logprob_start_len |`Optional[Union[List[int], int]] = None`| If returning log probabilities, specifies the start position in the prompt. Default is "-1", which returns logprobs only for output tokens. |
18
+
| top_logprobs_num |`Optional[Union[List[int], int]] = None`| If returning log probabilities, specifies the number of top logprobs to return at each position. |
19
+
| stream |`bool = False`| Whether to stream the output. |
| custom_logit_processor |`Optional[Union[List[Optional[str]], str]] = None`| Custom logit processor for advanced sampling control. For usage see below. |
22
+
| return_hidden_states |`bool = False`| Whether to return hidden states of the model. Note that each time it changes, the CUDA graph will be recaptured, which might lead to a performance hit. See the [examples](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states) for more information. |
21
23
22
24
## Sampling parameters
23
25
24
26
### Core parameters
25
27
26
-
*`max_new_tokens: int = 128` The maximum output length measured in tokens.
27
-
*`stop: Optional[Union[str, List[str]]] = None` One or multiple [stop words](https://platform.openai.com/docs/api-reference/chat/create#chat-create-stop). Generation will stop if one of these words is sampled.
28
-
*`stop_token_ids: Optional[List[int]] = None` Provide stop words in the form of token IDs. Generation will stop if one of these token IDs is sampled.
29
-
*`temperature: float = 1.0`[Temperature](https://platform.openai.com/docs/api-reference/chat/create#chat-create-temperature) when sampling the next token. `temperature = 0` corresponds to greedy sampling, a higher temperature leads to more diversity.
30
-
*`top_p: float = 1.0`[Top-p](https://platform.openai.com/docs/api-reference/chat/create#chat-create-top_p) selects tokens from the smallest sorted set whose cumulative probability exceeds `top_p`. When `top_p = 1`, this reduces to unrestricted sampling from all tokens.
31
-
*`top_k: int = -1`[Top-k](https://developer.nvidia.com/blog/how-to-get-better-outputs-from-your-large-language-model/#predictability_vs_creativity) randomly selects from the `k` highest-probability tokens.
32
-
*`min_p: float = 0.0`[Min-p](https://github.com/huggingface/transformers/issues/27670) samples from tokens with probability larger than `min_p * highest_token_probability`.
| max_new_tokens |`int = 128`| The maximum output length measured in tokens. |
31
+
| stop |`Optional[Union[str, List[str]]] = None`| One or multiple [stop words](https://platform.openai.com/docs/api-reference/chat/create#chat-create-stop). Generation will stop if one of these words is sampled. |
32
+
| stop_token_ids |`Optional[List[int]] = None`| Provide stop words in the form of token IDs. Generation will stop if one of these token IDs is sampled. |
33
+
| temperature |`float = 1.0`|[Temperature](https://platform.openai.com/docs/api-reference/chat/create#chat-create-temperature) when sampling the next token. `temperature = 0` corresponds to greedy sampling, a higher temperature leads to more diversity. |
34
+
| top_p |`float = 1.0`|[Top-p](https://platform.openai.com/docs/api-reference/chat/create#chat-create-top_p) selects tokens from the smallest sorted set whose cumulative probability exceeds `top_p`. When `top_p = 1`, this reduces to unrestricted sampling from all tokens. |
35
+
| top_k |`int = -1`|[Top-k](https://developer.nvidia.com/blog/how-to-get-better-outputs-from-your-large-language-model/#predictability_vs_creativity) randomly selects from the `k` highest-probability tokens. |
36
+
| min_p |`float = 0.0`|[Min-p](https://github.com/huggingface/transformers/issues/27670) samples from tokens with probability larger than `min_p * highest_token_probability`. |
33
37
34
38
### Penalizers
35
39
36
-
*`frequency_penalty: float = 0.0`: Penalizes tokens based on their frequency in generation so far. Must be between `-2` and `2` where negative numbers encourage repeatment of tokens and positive number encourages sampling of new tokens. The scaling of penalization grows linearly with each appearance of a token.
37
-
*`presence_penalty: float = 0.0`: Penalizes tokens if they appeared in the generation so far. Must be between `-2` and `2` where negative numbers encourage repeatment of tokens and positive number encourages sampling of new tokens. The scaling of the penalization is constant if a token occured.
38
-
*`min_new_tokens: int = 0`: Forces the model to generate at least `min_new_tokens` until a stop word or EOS token is sampled. Note that this might lead to unintended behavior, for example, if the distribution is highly skewed towards these tokens.
| frequency_penalty |`float = 0.0`| Penalizes tokens based on their frequency in generation so far. Must be between `-2` and `2` where negative numbers encourage repeatment of tokens and positive number encourages sampling of new tokens. The scaling of penalization grows linearly with each appearance of a token. |
43
+
| presence_penalty |`float = 0.0`| Penalizes tokens if they appeared in the generation so far. Must be between `-2` and `2` where negative numbers encourage repeatment of tokens and positive number encourages sampling of new tokens. The scaling of the penalization is constant if a token occured. |
44
+
| min_new_tokens |`int = 0`| Forces the model to generate at least `min_new_tokens` until a stop word or EOS token is sampled. Note that this might lead to unintended behavior, for example, if the distribution is highly skewed towards these tokens. |
39
45
40
46
### Constrained decoding
41
47
42
48
Please refer to our dedicated guide on [constrained decoding](./structured_outputs.ipynb) for the following parameters.
43
49
44
-
*`json_schema: Optional[str] = None`: JSON schema for structured outputs.
45
-
*`regex: Optional[str] = None`: Regex for structured outputs.
46
-
*`ebnf: Optional[str] = None`: EBNF for structured outputs.
| regex |`Optional[str] = None`| Regex for structured outputs. |
54
+
| ebnf |`Optional[str] = None`| EBNF for structured outputs. |
47
55
48
56
### Other options
49
57
50
-
*`n: int = 1`: Specifies the number of output sequences to generate per request. (Generating multiple outputs in one request (n > 1) is discouraged; repeating the same prompts several times offers better control and efficiency.)
51
-
*`spaces_between_special_tokens: bool = True`: Whether or not to add spaces between special tokens during detokenization.
52
-
*`no_stop_trim: bool = False`: Don't trim stop words or EOS token from the generated text.
53
-
*`continue_final_message: bool = False` : When enabled, the final assistant message is removed and its content is used as a prefill so that the model continues that message instead of starting a new turn. See [openai_chat_with_response_prefill.py](https://github.com/sgl-project/sglang/blob/main/examples/runtime/openai_chat_with_response_prefill.py) for examples.
54
-
*`ignore_eos: bool = False`: Don't stop generation when EOS token is sampled.
55
-
*`skip_special_tokens: bool = True`: Remove special tokens during decoding.
56
-
*`custom_params: Optional[List[Optional[Dict[str, Any]]]] = None`: Used when employing `CustomLogitProcessor`. For usage, see below.
| n |`int = 1`| Specifies the number of output sequences to generate per request. (Generating multiple outputs in one request (n > 1) is discouraged; repeating the same prompts several times offers better control and efficiency.) |
61
+
| spaces_between_special_tokens |`bool = True`| Whether or not to add spaces between special tokens during detokenization. |
62
+
| no_stop_trim |`bool = False`| Don't trim stop words or EOS token from the generated text. |
63
+
| continue_final_message |`bool = False`| When enabled, the final assistant message is removed and its content is used as a prefill so that the model continues that message instead of starting a new turn. See [openai_chat_with_response_prefill.py](https://github.com/sgl-project/sglang/blob/main/examples/runtime/openai_chat_with_response_prefill.py) for examples. |
64
+
| ignore_eos |`bool = False`| Don't stop generation when EOS token is sampled. |
65
+
| skip_special_tokens |`bool = True`| Remove special tokens during decoding. |
66
+
| custom_params |`Optional[List[Optional[Dict[str, Any]]]] = None`| Used when employing `CustomLogitProcessor`. For usage, see below. |
0 commit comments