[InfEngineClient] Extract out routing logic to a helper#267
Merged
CharlieFRuan merged 3 commits intomainfrom Sep 9, 2025
Merged
[InfEngineClient] Extract out routing logic to a helper#267CharlieFRuan merged 3 commits intomainfrom
CharlieFRuan merged 3 commits intomainfrom
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request effectively refactors the prompt routing logic by extracting it into a new helper function, route_prompts_to_engines. This is a solid improvement that unifies two previously separate code paths, reducing duplication and enhancing maintainability. The switch to hash_with_sha256 for deterministic routing is a crucial fix for correctness in a distributed environment. The changes are well-supported by a new, comprehensive unit test file. My feedback includes a couple of suggestions to further refine the code for type correctness and conciseness.
skyrl-train/skyrl_train/inference_engines/inference_engine_client.py
Outdated
Show resolved
Hide resolved
…ent.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
fa03ddc to
3bb851d
Compare
CharlieFRuan
added a commit
that referenced
this pull request
Sep 12, 2025
…#260) This PR implements the `/completions` endpoint for our HTTP inference engine client endpoint, as a follow up to the `/chat/completions`. This is a necessity if user wants to do token-in-token-out while using an HTTP endpoint. The main tricky part of `/completions` compared to `/chat/completions` is that it can be either batched or single-request. We handle routing using the util method introduced in #267, making `handle_completions()` very similar to `generate()`. ### Tested - Added a `_generate_with_chat_completions_http_endpoint()` to the dummy `examples/inference_http_endpoint/skyrl_gym_http_generator.py`, which offers a good sanity check. The GSM8K result (with 4xa100 sxm) is as follow: <img width="1262" height="664" alt="image" src="https://github.com/user-attachments/assets/f8a1748d-e6f6-4a22-a82a-7d9426eafd1d" /> - Legend: - `charlie-pr260-...`: this PR running `skyrl-train/examples/inference_http_endpoint/skyrl_gym_http_generator.py` with `/completions` - `charlie-pr267-4xa100`: regular `run_gsm8k.py` with Python `InferenceEngineClient.generate()` at commit 4750548 - `charlie-pr238-http`: `skyrl-train/examples/inference_http_endpoint/skyrl_gym_http_generator.py` with `/chat/completions` - Correctness is good. Performance-wise, it is better than `/chat/completions` (since `/completions` is batched, so only one request per engine), but still slower than Python counterpart. - Added GPU unit test (did not parameterize the test, but use a for loop within the test, hence saving time by avoiding spinning up engine multiple times; tried using fixture by it makes server/ray shutdown tricky). Verified locally
ztcanddota
added a commit
to ztcanddota/skyagent
that referenced
this pull request
Sep 28, 2025
… (#260) This PR implements the `/completions` endpoint for our HTTP inference engine client endpoint, as a follow up to the `/chat/completions`. This is a necessity if user wants to do token-in-token-out while using an HTTP endpoint. The main tricky part of `/completions` compared to `/chat/completions` is that it can be either batched or single-request. We handle routing using the util method introduced in NovaSky-AI/SkyRL#267, making `handle_completions()` very similar to `generate()`. ### Tested - Added a `_generate_with_chat_completions_http_endpoint()` to the dummy `examples/inference_http_endpoint/skyrl_gym_http_generator.py`, which offers a good sanity check. The GSM8K result (with 4xa100 sxm) is as follow: <img width="1262" height="664" alt="image" src="https://github.com/user-attachments/assets/f8a1748d-e6f6-4a22-a82a-7d9426eafd1d" /> - Legend: - `charlie-pr260-...`: this PR running `skyrl-train/examples/inference_http_endpoint/skyrl_gym_http_generator.py` with `/completions` - `charlie-pr267-4xa100`: regular `run_gsm8k.py` with Python `InferenceEngineClient.generate()` at commit NovaSky-AI/SkyRL@fbb8b5b - `charlie-pr238-http`: `skyrl-train/examples/inference_http_endpoint/skyrl_gym_http_generator.py` with `/chat/completions` - Correctness is good. Performance-wise, it is better than `/chat/completions` (since `/completions` is batched, so only one request per engine), but still slower than Python counterpart. - Added GPU unit test (did not parameterize the test, but use a for loop within the test, hence saving time by avoiding spinning up engine multiple times; tried using fixture by it makes server/ray shutdown tricky). Verified locally
SungjunlaLee
added a commit
to SungjunlaLee/SkyRL
that referenced
this pull request
Jan 3, 2026
… (#260) This PR implements the `/completions` endpoint for our HTTP inference engine client endpoint, as a follow up to the `/chat/completions`. This is a necessity if user wants to do token-in-token-out while using an HTTP endpoint. The main tricky part of `/completions` compared to `/chat/completions` is that it can be either batched or single-request. We handle routing using the util method introduced in NovaSky-AI/SkyRL#267, making `handle_completions()` very similar to `generate()`. ### Tested - Added a `_generate_with_chat_completions_http_endpoint()` to the dummy `examples/inference_http_endpoint/skyrl_gym_http_generator.py`, which offers a good sanity check. The GSM8K result (with 4xa100 sxm) is as follow: <img width="1262" height="664" alt="image" src="https://github.com/user-attachments/assets/f8a1748d-e6f6-4a22-a82a-7d9426eafd1d" /> - Legend: - `charlie-pr260-...`: this PR running `skyrl-train/examples/inference_http_endpoint/skyrl_gym_http_generator.py` with `/completions` - `charlie-pr267-4xa100`: regular `run_gsm8k.py` with Python `InferenceEngineClient.generate()` at commit NovaSky-AI/SkyRL@0e53025 - `charlie-pr238-http`: `skyrl-train/examples/inference_http_endpoint/skyrl_gym_http_generator.py` with `/chat/completions` - Correctness is good. Performance-wise, it is better than `/chat/completions` (since `/completions` is batched, so only one request per engine), but still slower than Python counterpart. - Added GPU unit test (did not parameterize the test, but use a for loop within the test, hence saving time by avoiding spinning up engine multiple times; tried using fixture by it makes server/ray shutdown tricky). Verified locally
dzorlu
referenced
this pull request
in fleet-ai/SkyRL
Feb 4, 2026
This PR removes `_generate_batched()` and
`_generate_with_trajectory_routing()` in `InferenceEngineClient` by
adding a helper method that:
```python
def route_prompts_to_engines(
num_prompts: int, num_inference_engines: int, trajectory_ids: Optional[Union[list[int], list[str]]]
) -> dict[int, list[int]]:
```
returning a mapping from engine ID to the prompt IDs they work on.
This removes the duplicated code when gathering the output.
In addition, for the `/completions` PR, this will be highly beneficial.
Without this PR we would need to duplicate
`_generate_with_trajectory_routing()` and `_generate_batched()`, which
are for `InferenceEngineInput` and make another pair for
`CompletionRequest`.
We also change `hash()` to
`int.from_bytes(hashlib.sha256(str(x).encode()).digest(), "big")`, since
the former is seeded by a random seed that is per-python-process. The
latter is deterministic.
### Testing
- Added new unit test
- GPU CI passed
- GSM8K, i.e. single-turn code path (ran both batched and non-batched).
Same generation time, reward, eval. Compared against PR #238
<img width="1209" height="610" alt="image"
src="https://github.com/user-attachments/assets/974d7266-738b-4d69-9c56-28d9546d14fb"
/>
- search-r1 (compared against the TITO PR, with single-turn and 4turns).
Reward and eval are same -- meaning keeping the original order worked as
expected; timing is roughly 7% slower for some reason. Will assume it to
be variance.
<img width="1414" height="627" alt="image"
src="https://github.com/user-attachments/assets/958d5cfe-d384-43a4-a4ed-43ca06e3515d"
/>
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
dzorlu
pushed a commit
to fleet-ai/SkyRL
that referenced
this pull request
Feb 4, 2026
…#260) This PR implements the `/completions` endpoint for our HTTP inference engine client endpoint, as a follow up to the `/chat/completions`. This is a necessity if user wants to do token-in-token-out while using an HTTP endpoint. The main tricky part of `/completions` compared to `/chat/completions` is that it can be either batched or single-request. We handle routing using the util method introduced in NovaSky-AI#267, making `handle_completions()` very similar to `generate()`. ### Tested - Added a `_generate_with_chat_completions_http_endpoint()` to the dummy `examples/inference_http_endpoint/skyrl_gym_http_generator.py`, which offers a good sanity check. The GSM8K result (with 4xa100 sxm) is as follow: <img width="1262" height="664" alt="image" src="https://github.com/user-attachments/assets/f8a1748d-e6f6-4a22-a82a-7d9426eafd1d" /> - Legend: - `charlie-pr260-...`: this PR running `skyrl-train/examples/inference_http_endpoint/skyrl_gym_http_generator.py` with `/completions` - `charlie-pr267-4xa100`: regular `run_gsm8k.py` with Python `InferenceEngineClient.generate()` at commit NovaSky-AI@9385d50 - `charlie-pr238-http`: `skyrl-train/examples/inference_http_endpoint/skyrl_gym_http_generator.py` with `/chat/completions` - Correctness is good. Performance-wise, it is better than `/chat/completions` (since `/completions` is batched, so only one request per engine), but still slower than Python counterpart. - Added GPU unit test (did not parameterize the test, but use a for loop within the test, hence saving time by avoiding spinning up engine multiple times; tried using fixture by it makes server/ray shutdown tricky). Verified locally
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR removes
_generate_batched()and_generate_with_trajectory_routing()inInferenceEngineClientby adding a helper method that:returning a mapping from engine ID to the prompt IDs they work on.
This removes the duplicated code when gathering the output.
In addition, for the
/completionsPR, this will be highly beneficial. Without this PR we would need to duplicate_generate_with_trajectory_routing()and_generate_batched(), which are forInferenceEngineInputand make another pair forCompletionRequest.We also change
hash()toint.from_bytes(hashlib.sha256(str(x).encode()).digest(), "big"), since the former is seeded by a random seed that is per-python-process. The latter is deterministic.Testing