[fix] Skip special tokens for sglang when decoding token ids#210
Merged
tyler-griggs merged 1 commit intoNovaSky-AI:mainfrom Aug 26, 2025
Merged
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request addresses an inconsistency in how special tokens are handled between the vLLM and SGLang inference backends. The changes ensure that when decoding token IDs from SGLang, special tokens are skipped, aligning its behavior with vLLM and preventing issues like duplicate end-of-sequence tokens. The modifications are targeted, correct, and consistent across both the local and remote SGLang engine implementations. The accompanying update to the documentation in base.py clearly reflects this new behavior. Overall, this is a solid fix that improves the consistency and correctness of the inference pipeline.
dzorlu
referenced
this pull request
in fleet-ai/SkyRL
Feb 4, 2026
### Background Prior PR NovaSky-AI#192 changed `skip_special_token` to True in sampling params for vLLM and SGLang. Take Qwen as an example, the implication is that: - For string output, i.e. `InferenceEngineOutput.response`, `<|im_end|>` will be excluded - For token IDs output, i.e. `InferenceEngineOutput.response_ids`, the corresponding token **will NOT be excluded**, which is what we want for token-in-token-out The goal is two folds: - In SkyRLGymGenerator, `env.step(action)`'s `action` is `InferenceEngineOutput.response`, in which we do not want EOS tokens. This is more consistent with most inference engine output convention - If users re-tokenize with chat template (e.g. current Qwen3 codepath), the Jinja template applies `<|im_end|>` after `message.content`, meaning that the string content does not expect to include `<|im_end|>` - Qwen has Jinja template: `'<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n'` - Though for SkyRLGymGenerator, we have the following code, so double EOS `<|im_end|><|im_end|>` wouldn't happen (but for custom generator, it would be more subtle for the users to realize). ```python # remove eos token from end of output if it exists, since it will be reapplied by the chat template if output.endswith(self.tokenizer.eos_token): output = output[: -len(self.tokenizer.eos_token)] ``` ### This PR's fix This PR further fixes for SGLang. We use token-in-token-out for SGLang by passing in `skip_tokenizer_init=True`, which means SGLang does not return string output at all (different from vLLM, which returns both token IDs and string output). Thus, `sampling_params.skip_special_token` has no effect on SGLang. This PR fixes this by adding `skip_special_token=True` when we convert SGLang's output token IDs to string output so its behavior is the same as vLLM backend. Verified with `test_policy_local_engines_e2e.py`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
Prior PR #192 changed
skip_special_tokento True in sampling params for vLLM and SGLang.Take Qwen as an example, the implication is that:
InferenceEngineOutput.response,<|im_end|>will be excludedInferenceEngineOutput.response_ids, the corresponding token will NOT be excluded, which is what we want for token-in-token-outThe goal is two folds:
env.step(action)'sactionisInferenceEngineOutput.response, in which we do not want EOS tokens. This is more consistent with most inference engine output convention<|im_end|>aftermessage.content, meaning that the string content does not expect to include<|im_end|>'<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n'<|im_end|><|im_end|>wouldn't happen (but for custom generator, it would be more subtle for the users to realize).This PR's fix
This PR further fixes for SGLang. We use token-in-token-out for SGLang by passing in
skip_tokenizer_init=True, which means SGLang does not return string output at all (different from vLLM, which returns both token IDs and string output). Thus,sampling_params.skip_special_tokenhas no effect on SGLang. This PR fixes this by addingskip_special_token=Truewhen we convert SGLang's output token IDs to string output so its behavior is the same as vLLM backend.Verified with
test_policy_local_engines_e2e.py.