Skip to content

[InfEngineClient] Extract out routing logic to a helper#267

Merged
CharlieFRuan merged 3 commits intomainfrom
pr-0908-route
Sep 9, 2025
Merged

[InfEngineClient] Extract out routing logic to a helper#267
CharlieFRuan merged 3 commits intomainfrom
pr-0908-route

Conversation

@CharlieFRuan
Copy link
Collaborator

@CharlieFRuan CharlieFRuan commented Sep 9, 2025

This PR removes _generate_batched() and _generate_with_trajectory_routing() in InferenceEngineClient by adding a helper method that:

def route_prompts_to_engines(
    num_prompts: int, num_inference_engines: int, trajectory_ids: Optional[Union[list[int], list[str]]]
) -> dict[int, list[int]]:

returning a mapping from engine ID to the prompt IDs they work on.

This removes the duplicated code when gathering the output.

In addition, for the /completions PR, this will be highly beneficial. Without this PR we would need to duplicate _generate_with_trajectory_routing() and _generate_batched(), which are for InferenceEngineInput and make another pair for CompletionRequest.

We also change hash() to int.from_bytes(hashlib.sha256(str(x).encode()).digest(), "big"), since the former is seeded by a random seed that is per-python-process. The latter is deterministic.

Testing

image
  • search-r1 (compared against the TITO PR, with single-turn and 4turns). Reward and eval are same -- meaning keeping the original order worked as expected; timing is roughly 7% slower for some reason. Will assume it to be variance.
image

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively refactors the prompt routing logic by extracting it into a new helper function, route_prompts_to_engines. This is a solid improvement that unifies two previously separate code paths, reducing duplication and enhancing maintainability. The switch to hash_with_sha256 for deterministic routing is a crucial fix for correctness in a distributed environment. The changes are well-supported by a new, comprehensive unit test file. My feedback includes a couple of suggestions to further refine the code for type correctness and conciseness.

Copy link
Member

@SumanthRH SumanthRH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Copy link
Member

@tyler-griggs tyler-griggs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better, thank you

CharlieFRuan and others added 3 commits September 9, 2025 03:55
…ent.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@CharlieFRuan CharlieFRuan merged commit 4750548 into main Sep 9, 2025
3 checks passed
@CharlieFRuan CharlieFRuan deleted the pr-0908-route branch September 9, 2025 07:35
CharlieFRuan added a commit that referenced this pull request Sep 12, 2025
…#260)

This PR implements the `/completions` endpoint for our HTTP inference
engine client endpoint, as a follow up to the `/chat/completions`. This
is a necessity if user wants to do token-in-token-out while using an
HTTP endpoint.

The main tricky part of `/completions` compared to `/chat/completions`
is that it can be either batched or single-request. We handle routing
using the util method introduced in
#267, making
`handle_completions()` very similar to `generate()`.

### Tested
- Added a `_generate_with_chat_completions_http_endpoint()` to the dummy
`examples/inference_http_endpoint/skyrl_gym_http_generator.py`, which
offers a good sanity check. The GSM8K result (with 4xa100 sxm) is as
follow:
<img width="1262" height="664" alt="image"
src="https://github.com/user-attachments/assets/f8a1748d-e6f6-4a22-a82a-7d9426eafd1d"
/>


- Legend:
- `charlie-pr260-...`: this PR running
`skyrl-train/examples/inference_http_endpoint/skyrl_gym_http_generator.py`
with `/completions`
- `charlie-pr267-4xa100`: regular `run_gsm8k.py` with Python
`InferenceEngineClient.generate()` at commit
4750548
- `charlie-pr238-http`:
`skyrl-train/examples/inference_http_endpoint/skyrl_gym_http_generator.py`
with `/chat/completions`
- Correctness is good. Performance-wise, it is better than
`/chat/completions` (since `/completions` is batched, so only one
request per engine), but still slower than Python counterpart.

- Added GPU unit test (did not parameterize the test, but use a for loop
within the test, hence saving time by avoiding spinning up engine
multiple times; tried using fixture by it makes server/ray shutdown
tricky). Verified locally
ztcanddota added a commit to ztcanddota/skyagent that referenced this pull request Sep 28, 2025
… (#260)

This PR implements the `/completions` endpoint for our HTTP inference
engine client endpoint, as a follow up to the `/chat/completions`. This
is a necessity if user wants to do token-in-token-out while using an
HTTP endpoint.

The main tricky part of `/completions` compared to `/chat/completions`
is that it can be either batched or single-request. We handle routing
using the util method introduced in
NovaSky-AI/SkyRL#267, making
`handle_completions()` very similar to `generate()`.

### Tested
- Added a `_generate_with_chat_completions_http_endpoint()` to the dummy
`examples/inference_http_endpoint/skyrl_gym_http_generator.py`, which
offers a good sanity check. The GSM8K result (with 4xa100 sxm) is as
follow:
<img width="1262" height="664" alt="image"
src="https://github.com/user-attachments/assets/f8a1748d-e6f6-4a22-a82a-7d9426eafd1d"
/>


- Legend:
- `charlie-pr260-...`: this PR running
`skyrl-train/examples/inference_http_endpoint/skyrl_gym_http_generator.py`
with `/completions`
- `charlie-pr267-4xa100`: regular `run_gsm8k.py` with Python
`InferenceEngineClient.generate()` at commit
NovaSky-AI/SkyRL@fbb8b5b
- `charlie-pr238-http`:
`skyrl-train/examples/inference_http_endpoint/skyrl_gym_http_generator.py`
with `/chat/completions`
- Correctness is good. Performance-wise, it is better than
`/chat/completions` (since `/completions` is batched, so only one
request per engine), but still slower than Python counterpart.

- Added GPU unit test (did not parameterize the test, but use a for loop
within the test, hence saving time by avoiding spinning up engine
multiple times; tried using fixture by it makes server/ray shutdown
tricky). Verified locally
SungjunlaLee added a commit to SungjunlaLee/SkyRL that referenced this pull request Jan 3, 2026
… (#260)

This PR implements the `/completions` endpoint for our HTTP inference
engine client endpoint, as a follow up to the `/chat/completions`. This
is a necessity if user wants to do token-in-token-out while using an
HTTP endpoint.

The main tricky part of `/completions` compared to `/chat/completions`
is that it can be either batched or single-request. We handle routing
using the util method introduced in
NovaSky-AI/SkyRL#267, making
`handle_completions()` very similar to `generate()`.

### Tested
- Added a `_generate_with_chat_completions_http_endpoint()` to the dummy
`examples/inference_http_endpoint/skyrl_gym_http_generator.py`, which
offers a good sanity check. The GSM8K result (with 4xa100 sxm) is as
follow:
<img width="1262" height="664" alt="image"
src="https://github.com/user-attachments/assets/f8a1748d-e6f6-4a22-a82a-7d9426eafd1d"
/>


- Legend:
- `charlie-pr260-...`: this PR running
`skyrl-train/examples/inference_http_endpoint/skyrl_gym_http_generator.py`
with `/completions`
- `charlie-pr267-4xa100`: regular `run_gsm8k.py` with Python
`InferenceEngineClient.generate()` at commit
NovaSky-AI/SkyRL@0e53025
- `charlie-pr238-http`:
`skyrl-train/examples/inference_http_endpoint/skyrl_gym_http_generator.py`
with `/chat/completions`
- Correctness is good. Performance-wise, it is better than
`/chat/completions` (since `/completions` is batched, so only one
request per engine), but still slower than Python counterpart.

- Added GPU unit test (did not parameterize the test, but use a for loop
within the test, hence saving time by avoiding spinning up engine
multiple times; tried using fixture by it makes server/ray shutdown
tricky). Verified locally
dzorlu referenced this pull request in fleet-ai/SkyRL Feb 4, 2026
This PR removes `_generate_batched()` and
`_generate_with_trajectory_routing()` in `InferenceEngineClient` by
adding a helper method that:

```python
def route_prompts_to_engines(
    num_prompts: int, num_inference_engines: int, trajectory_ids: Optional[Union[list[int], list[str]]]
) -> dict[int, list[int]]:
```

returning a mapping from engine ID to the prompt IDs they work on.

This removes the duplicated code when gathering the output.

In addition, for the `/completions` PR, this will be highly beneficial.
Without this PR we would need to duplicate
`_generate_with_trajectory_routing()` and `_generate_batched()`, which
are for `InferenceEngineInput` and make another pair for
`CompletionRequest`.

We also change `hash()` to
`int.from_bytes(hashlib.sha256(str(x).encode()).digest(), "big")`, since
the former is seeded by a random seed that is per-python-process. The
latter is deterministic.

### Testing
- Added new unit test
- GPU CI passed
- GSM8K, i.e. single-turn code path (ran both batched and non-batched).
Same generation time, reward, eval. Compared against PR #238
<img width="1209" height="610" alt="image"
src="https://github.com/user-attachments/assets/974d7266-738b-4d69-9c56-28d9546d14fb"
/>

- search-r1 (compared against the TITO PR, with single-turn and 4turns).
Reward and eval are same -- meaning keeping the original order worked as
expected; timing is roughly 7% slower for some reason. Will assume it to
be variance.
<img width="1414" height="627" alt="image"
src="https://github.com/user-attachments/assets/958d5cfe-d384-43a4-a4ed-43ca06e3515d"
/>

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
dzorlu pushed a commit to fleet-ai/SkyRL that referenced this pull request Feb 4, 2026
…#260)

This PR implements the `/completions` endpoint for our HTTP inference
engine client endpoint, as a follow up to the `/chat/completions`. This
is a necessity if user wants to do token-in-token-out while using an
HTTP endpoint.

The main tricky part of `/completions` compared to `/chat/completions`
is that it can be either batched or single-request. We handle routing
using the util method introduced in
NovaSky-AI#267, making
`handle_completions()` very similar to `generate()`.

### Tested
- Added a `_generate_with_chat_completions_http_endpoint()` to the dummy
`examples/inference_http_endpoint/skyrl_gym_http_generator.py`, which
offers a good sanity check. The GSM8K result (with 4xa100 sxm) is as
follow:
<img width="1262" height="664" alt="image"
src="https://github.com/user-attachments/assets/f8a1748d-e6f6-4a22-a82a-7d9426eafd1d"
/>


- Legend:
- `charlie-pr260-...`: this PR running
`skyrl-train/examples/inference_http_endpoint/skyrl_gym_http_generator.py`
with `/completions`
- `charlie-pr267-4xa100`: regular `run_gsm8k.py` with Python
`InferenceEngineClient.generate()` at commit
NovaSky-AI@9385d50
- `charlie-pr238-http`:
`skyrl-train/examples/inference_http_endpoint/skyrl_gym_http_generator.py`
with `/chat/completions`
- Correctness is good. Performance-wise, it is better than
`/chat/completions` (since `/completions` is batched, so only one
request per engine), but still slower than Python counterpart.

- Added GPU unit test (did not parameterize the test, but use a for loop
within the test, hence saving time by avoiding spinning up engine
multiple times; tried using fixture by it makes server/ray shutdown
tricky). Verified locally
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants