[fix] Fix the multi-engine launch script#157
Conversation
There was a problem hiding this comment.
Code Review
This pull request successfully fixes the multi-engine launch script by adding a missing configuration parameter and refactoring the Ray initialization logic. The refactoring separates environment preparation into its own function, which improves code clarity and robustness, especially in how it determines the maximum number of GPUs per node. The changes look good, but I have one suggestion to improve consistency in how timeouts are handled.
| placement_groups = [placement_group(bundle) for bundle in bundles] | ||
| for pg in placement_groups: | ||
| print(f"Waiting for Ray placement group to be ready...") | ||
| get_ray_pg_ready_with_timeout(pg, timeout=180) |
There was a problem hiding this comment.
The timeout for waiting for the Ray placement group is hardcoded to 180. It would be more consistent and flexible to use the value from the --timeout command-line argument, which is stored in args.timeout. This allows users to control all related timeouts with a single parameter.
| get_ray_pg_ready_with_timeout(pg, timeout=180) | |
| get_ray_pg_ready_with_timeout(pg, timeout=args.timeout) |
What does this PR do?
initialize_rayto separate out env preparation (e.g., determining envvars).Testing
launch_multiple_remote_servers.pywith 4 inference engines.