[InfEngineClient] Extract out routing logic to a helper by CharlieFRuan · Pull Request #267 · NovaSky-AI/SkyRL

CharlieFRuan · 2025-09-09T00:09:10Z

This PR removes _generate_batched() and _generate_with_trajectory_routing() in InferenceEngineClient by adding a helper method that:

def route_prompts_to_engines(
    num_prompts: int, num_inference_engines: int, trajectory_ids: Optional[Union[list[int], list[str]]]
) -> dict[int, list[int]]:

returning a mapping from engine ID to the prompt IDs they work on.

This removes the duplicated code when gathering the output.

In addition, for the /completions PR, this will be highly beneficial. Without this PR we would need to duplicate _generate_with_trajectory_routing() and _generate_batched(), which are for InferenceEngineInput and make another pair for CompletionRequest.

We also change hash() to int.from_bytes(hashlib.sha256(str(x).encode()).digest(), "big"), since the former is seeded by a random seed that is per-python-process. The latter is deterministic.

Testing

Added new unit test
GPU CI passed
GSM8K, i.e. single-turn code path (ran both batched and non-batched). Same generation time, reward, eval. Compared against PR [HTTP][Generator] Let vllm python engine handle OAI request, remove openai_api_protocol #238

search-r1 (compared against the TITO PR, with single-turn and 4turns). Reward and eval are same -- meaning keeping the original order worked as expected; timing is roughly 7% slower for some reason. Will assume it to be variance.

gemini-code-assist

Code Review

This pull request effectively refactors the prompt routing logic by extracting it into a new helper function, route_prompts_to_engines. This is a solid improvement that unifies two previously separate code paths, reducing duplication and enhancing maintainability. The switch to hash_with_sha256 for deterministic routing is a crucial fix for correctness in a distributed environment. The changes are well-supported by a new, comprehensive unit test file. My feedback includes a couple of suggestions to further refine the code for type correctness and conciseness.

skyrl-train/skyrl_train/inference_engines/inference_engine_client.py

skyrl-train/skyrl_train/inference_engines/utils.py

SumanthRH

Thanks

tyler-griggs

Much better, thank you

…ent.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…#260) This PR implements the `/completions` endpoint for our HTTP inference engine client endpoint, as a follow up to the `/chat/completions`. This is a necessity if user wants to do token-in-token-out while using an HTTP endpoint. The main tricky part of `/completions` compared to `/chat/completions` is that it can be either batched or single-request. We handle routing using the util method introduced in #267, making `handle_completions()` very similar to `generate()`. ### Tested - Added a `_generate_with_chat_completions_http_endpoint()` to the dummy `examples/inference_http_endpoint/skyrl_gym_http_generator.py`, which offers a good sanity check. The GSM8K result (with 4xa100 sxm) is as follow: <img width="1262" height="664" alt="image" src="https://github.com/user-attachments/assets/f8a1748d-e6f6-4a22-a82a-7d9426eafd1d" /> - Legend: - `charlie-pr260-...`: this PR running `skyrl-train/examples/inference_http_endpoint/skyrl_gym_http_generator.py` with `/completions` - `charlie-pr267-4xa100`: regular `run_gsm8k.py` with Python `InferenceEngineClient.generate()` at commit 4750548 - `charlie-pr238-http`: `skyrl-train/examples/inference_http_endpoint/skyrl_gym_http_generator.py` with `/chat/completions` - Correctness is good. Performance-wise, it is better than `/chat/completions` (since `/completions` is batched, so only one request per engine), but still slower than Python counterpart. - Added GPU unit test (did not parameterize the test, but use a for loop within the test, hence saving time by avoiding spinning up engine multiple times; tried using fixture by it makes server/ray shutdown tricky). Verified locally

… (#260) This PR implements the `/completions` endpoint for our HTTP inference engine client endpoint, as a follow up to the `/chat/completions`. This is a necessity if user wants to do token-in-token-out while using an HTTP endpoint. The main tricky part of `/completions` compared to `/chat/completions` is that it can be either batched or single-request. We handle routing using the util method introduced in NovaSky-AI/SkyRL#267, making `handle_completions()` very similar to `generate()`. ### Tested - Added a `_generate_with_chat_completions_http_endpoint()` to the dummy `examples/inference_http_endpoint/skyrl_gym_http_generator.py`, which offers a good sanity check. The GSM8K result (with 4xa100 sxm) is as follow: <img width="1262" height="664" alt="image" src="https://github.com/user-attachments/assets/f8a1748d-e6f6-4a22-a82a-7d9426eafd1d" /> - Legend: - `charlie-pr260-...`: this PR running `skyrl-train/examples/inference_http_endpoint/skyrl_gym_http_generator.py` with `/completions` - `charlie-pr267-4xa100`: regular `run_gsm8k.py` with Python `InferenceEngineClient.generate()` at commit NovaSky-AI/SkyRL@fbb8b5b - `charlie-pr238-http`: `skyrl-train/examples/inference_http_endpoint/skyrl_gym_http_generator.py` with `/chat/completions` - Correctness is good. Performance-wise, it is better than `/chat/completions` (since `/completions` is batched, so only one request per engine), but still slower than Python counterpart. - Added GPU unit test (did not parameterize the test, but use a for loop within the test, hence saving time by avoiding spinning up engine multiple times; tried using fixture by it makes server/ray shutdown tricky). Verified locally

… (#260) This PR implements the `/completions` endpoint for our HTTP inference engine client endpoint, as a follow up to the `/chat/completions`. This is a necessity if user wants to do token-in-token-out while using an HTTP endpoint. The main tricky part of `/completions` compared to `/chat/completions` is that it can be either batched or single-request. We handle routing using the util method introduced in NovaSky-AI/SkyRL#267, making `handle_completions()` very similar to `generate()`. ### Tested - Added a `_generate_with_chat_completions_http_endpoint()` to the dummy `examples/inference_http_endpoint/skyrl_gym_http_generator.py`, which offers a good sanity check. The GSM8K result (with 4xa100 sxm) is as follow: <img width="1262" height="664" alt="image" src="https://github.com/user-attachments/assets/f8a1748d-e6f6-4a22-a82a-7d9426eafd1d" /> - Legend: - `charlie-pr260-...`: this PR running `skyrl-train/examples/inference_http_endpoint/skyrl_gym_http_generator.py` with `/completions` - `charlie-pr267-4xa100`: regular `run_gsm8k.py` with Python `InferenceEngineClient.generate()` at commit NovaSky-AI/SkyRL@0e53025 - `charlie-pr238-http`: `skyrl-train/examples/inference_http_endpoint/skyrl_gym_http_generator.py` with `/chat/completions` - Correctness is good. Performance-wise, it is better than `/chat/completions` (since `/completions` is batched, so only one request per engine), but still slower than Python counterpart. - Added GPU unit test (did not parameterize the test, but use a for loop within the test, hence saving time by avoiding spinning up engine multiple times; tried using fixture by it makes server/ray shutdown tricky). Verified locally

This PR removes `_generate_batched()` and `_generate_with_trajectory_routing()` in `InferenceEngineClient` by adding a helper method that: ```python def route_prompts_to_engines( num_prompts: int, num_inference_engines: int, trajectory_ids: Optional[Union[list[int], list[str]]] ) -> dict[int, list[int]]: ``` returning a mapping from engine ID to the prompt IDs they work on. This removes the duplicated code when gathering the output. In addition, for the `/completions` PR, this will be highly beneficial. Without this PR we would need to duplicate `_generate_with_trajectory_routing()` and `_generate_batched()`, which are for `InferenceEngineInput` and make another pair for `CompletionRequest`. We also change `hash()` to `int.from_bytes(hashlib.sha256(str(x).encode()).digest(), "big")`, since the former is seeded by a random seed that is per-python-process. The latter is deterministic. ### Testing - Added new unit test - GPU CI passed - GSM8K, i.e. single-turn code path (ran both batched and non-batched). Same generation time, reward, eval. Compared against PR #238 <img width="1209" height="610" alt="image" src="https://github.com/user-attachments/assets/974d7266-738b-4d69-9c56-28d9546d14fb" /> - search-r1 (compared against the TITO PR, with single-turn and 4turns). Reward and eval are same -- meaning keeping the original order worked as expected; timing is roughly 7% slower for some reason. Will assume it to be variance. <img width="1414" height="627" alt="image" src="https://github.com/user-attachments/assets/958d5cfe-d384-43a4-a4ed-43ca06e3515d" /> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…#260) This PR implements the `/completions` endpoint for our HTTP inference engine client endpoint, as a follow up to the `/chat/completions`. This is a necessity if user wants to do token-in-token-out while using an HTTP endpoint. The main tricky part of `/completions` compared to `/chat/completions` is that it can be either batched or single-request. We handle routing using the util method introduced in NovaSky-AI#267, making `handle_completions()` very similar to `generate()`. ### Tested - Added a `_generate_with_chat_completions_http_endpoint()` to the dummy `examples/inference_http_endpoint/skyrl_gym_http_generator.py`, which offers a good sanity check. The GSM8K result (with 4xa100 sxm) is as follow: <img width="1262" height="664" alt="image" src="https://github.com/user-attachments/assets/f8a1748d-e6f6-4a22-a82a-7d9426eafd1d" /> - Legend: - `charlie-pr260-...`: this PR running `skyrl-train/examples/inference_http_endpoint/skyrl_gym_http_generator.py` with `/completions` - `charlie-pr267-4xa100`: regular `run_gsm8k.py` with Python `InferenceEngineClient.generate()` at commit NovaSky-AI@9385d50 - `charlie-pr238-http`: `skyrl-train/examples/inference_http_endpoint/skyrl_gym_http_generator.py` with `/chat/completions` - Correctness is good. Performance-wise, it is better than `/chat/completions` (since `/completions` is batched, so only one request per engine), but still slower than Python counterpart. - Added GPU unit test (did not parameterize the test, but use a for loop within the test, hence saving time by avoiding spinning up engine multiple times; tried using fixture by it makes server/ray shutdown tricky). Verified locally

gemini-code-assist bot reviewed Sep 9, 2025

View reviewed changes

skyrl-train/skyrl_train/inference_engines/inference_engine_client.py Outdated Show resolved Hide resolved

skyrl-train/skyrl_train/inference_engines/utils.py Outdated Show resolved Hide resolved

SumanthRH approved these changes Sep 9, 2025

View reviewed changes

tyler-griggs approved these changes Sep 9, 2025

View reviewed changes

CharlieFRuan and others added 3 commits September 9, 2025 03:55

[InfEngineClient] Extract out routing logic to a helper

e3268d9

Update skyrl-train/skyrl_train/inference_engines/inference_engine_cli…

7c4c00b

…ent.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

address comments

3bb851d

CharlieFRuan force-pushed the pr-0908-route branch from fa03ddc to 3bb851d Compare September 9, 2025 03:55

CharlieFRuan merged commit 4750548 into main Sep 9, 2025
3 checks passed

CharlieFRuan deleted the pr-0908-route branch September 9, 2025 07:35

CharlieFRuan mentioned this pull request Sep 11, 2025

[Generator][HTTP] Support /completion endpoint for token-in-token-out #260

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[InfEngineClient] Extract out routing logic to a helper#267

[InfEngineClient] Extract out routing logic to a helper#267
CharlieFRuan merged 3 commits intomainfrom
pr-0908-route

CharlieFRuan commented Sep 9, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

SumanthRH left a comment

Uh oh!

tyler-griggs left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

CharlieFRuan commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

SumanthRH left a comment

Choose a reason for hiding this comment

Uh oh!

tyler-griggs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CharlieFRuan commented Sep 9, 2025 •

edited

Loading