Skip to content

[fix] Skip special tokens for sglang when decoding token ids#210

Merged
tyler-griggs merged 1 commit intoNovaSky-AI:mainfrom
CharlieFRuan:fix-0826-skip-special
Aug 26, 2025
Merged

[fix] Skip special tokens for sglang when decoding token ids#210
tyler-griggs merged 1 commit intoNovaSky-AI:mainfrom
CharlieFRuan:fix-0826-skip-special

Conversation

@CharlieFRuan
Copy link
Collaborator

@CharlieFRuan CharlieFRuan commented Aug 26, 2025

Background

Prior PR #192 changed skip_special_token to True in sampling params for vLLM and SGLang.

Take Qwen as an example, the implication is that:

  • For string output, i.e. InferenceEngineOutput.response, <|im_end|> will be excluded
  • For token IDs output, i.e. InferenceEngineOutput.response_ids, the corresponding token will NOT be excluded, which is what we want for token-in-token-out

The goal is two folds:

  • In SkyRLGymGenerator, env.step(action)'s action is InferenceEngineOutput.response, in which we do not want EOS tokens. This is more consistent with most inference engine output convention
  • If users re-tokenize with chat template (e.g. current Qwen3 codepath), the Jinja template applies <|im_end|> after message.content, meaning that the string content does not expect to include <|im_end|>
    • Qwen has Jinja template: '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n'
    • Though for SkyRLGymGenerator, we have the following code, so double EOS <|im_end|><|im_end|> wouldn't happen (but for custom generator, it would be more subtle for the users to realize).
  # remove eos token from end of output if it exists, since it will be reapplied by the chat template
  if output.endswith(self.tokenizer.eos_token):
      output = output[: -len(self.tokenizer.eos_token)]

This PR's fix

This PR further fixes for SGLang. We use token-in-token-out for SGLang by passing in skip_tokenizer_init=True, which means SGLang does not return string output at all (different from vLLM, which returns both token IDs and string output). Thus, sampling_params.skip_special_token has no effect on SGLang. This PR fixes this by adding skip_special_token=True when we convert SGLang's output token IDs to string output so its behavior is the same as vLLM backend.

Verified with test_policy_local_engines_e2e.py.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an inconsistency in how special tokens are handled between the vLLM and SGLang inference backends. The changes ensure that when decoding token IDs from SGLang, special tokens are skipped, aligning its behavior with vLLM and preventing issues like duplicate end-of-sequence tokens. The modifications are targeted, correct, and consistent across both the local and remote SGLang engine implementations. The accompanying update to the documentation in base.py clearly reflects this new behavior. Overall, this is a solid fix that improves the consistency and correctness of the inference pipeline.

@CharlieFRuan CharlieFRuan marked this pull request as draft August 26, 2025 22:30
@CharlieFRuan CharlieFRuan marked this pull request as ready for review August 26, 2025 22:40
Copy link
Member

@tyler-griggs tyler-griggs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks!

@tyler-griggs tyler-griggs merged commit 71c954b into NovaSky-AI:main Aug 26, 2025
3 checks passed
dzorlu referenced this pull request in fleet-ai/SkyRL Feb 4, 2026
### Background
Prior PR NovaSky-AI#192 changed
`skip_special_token` to True in sampling params for vLLM and SGLang.

Take Qwen as an example, the implication is that:
- For string output, i.e. `InferenceEngineOutput.response`, `<|im_end|>`
will be excluded
- For token IDs output, i.e. `InferenceEngineOutput.response_ids`, the
corresponding token **will NOT be excluded**, which is what we want for
token-in-token-out

The goal is two folds:
- In SkyRLGymGenerator, `env.step(action)`'s `action` is
`InferenceEngineOutput.response`, in which we do not want EOS tokens.
This is more consistent with most inference engine output convention
- If users re-tokenize with chat template (e.g. current Qwen3 codepath),
the Jinja template applies `<|im_end|>` after `message.content`, meaning
that the string content does not expect to include `<|im_end|>`
- Qwen has Jinja template: `'<|im_start|>' + message.role + '\\n' +
message.content + '<|im_end|>' + '\\n'`
- Though for SkyRLGymGenerator, we have the following code, so double
EOS `<|im_end|><|im_end|>` wouldn't happen (but for custom generator, it
would be more subtle for the users to realize).
```python
  # remove eos token from end of output if it exists, since it will be reapplied by the chat template
  if output.endswith(self.tokenizer.eos_token):
      output = output[: -len(self.tokenizer.eos_token)]
```

### This PR's fix
This PR further fixes for SGLang. We use token-in-token-out for SGLang
by passing in `skip_tokenizer_init=True`, which means SGLang does not
return string output at all (different from vLLM, which returns both
token IDs and string output). Thus, `sampling_params.skip_special_token`
has no effect on SGLang. This PR fixes this by adding
`skip_special_token=True` when we convert SGLang's output token IDs to
string output so its behavior is the same as vLLM backend.

Verified with `test_policy_local_engines_e2e.py`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants