Skip to content

[fix] Fix the pretty trainer logging#270

Merged
tyler-griggs merged 4 commits intomainfrom
tgriggs/log_fix_again
Sep 11, 2025
Merged

[fix] Fix the pretty trainer logging#270
tyler-griggs merged 4 commits intomainfrom
tgriggs/log_fix_again

Conversation

@tyler-griggs
Copy link
Member

Re-implement the logging fix of #250 that was reverted in #261

The issue was that using the worker_process_setup_hook to set logging behavior interfered with vLLM using Ray as it's tensor parallel backend and threw an error. vLLM apparently needs this to be unset.

Moved the logging configuration into RayPPOTrainer init.

@tyler-griggs tyler-griggs changed the title [fix] [fix] Fix the pretty trainer logging Sep 9, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix logging for Ray workers by moving the configuration logic out of the problematic worker_process_setup_hook. A new function, configure_ray_worker_logging, has been added to handle log formatting and routing.

However, there is a critical issue with the current implementation. The logging configuration function is called from RayPPOTrainer.__init__, which executes on the driver process. This means the logging for the actual Ray workers (the remote actors) will not be configured, and the fix will not have the intended effect. I've provided a critical review comment to address this by moving the function call to the worker initialization logic, which is essential for this fix to work as described.

Copy link
Member

@SumanthRH SumanthRH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's just verify with the GSM8K example

@tyler-griggs
Copy link
Member Author

Yes, done. And ran the test suite that failed previously

@tyler-griggs tyler-griggs merged commit fd8275f into main Sep 11, 2025
3 checks passed
@tyler-griggs tyler-griggs deleted the tgriggs/log_fix_again branch September 12, 2025 20:19
dzorlu referenced this pull request in fleet-ai/SkyRL Feb 4, 2026
Re-implement the logging fix of #250 that was reverted in #261

The issue was that using the `worker_process_setup_hook` to set logging
behavior interfered with vLLM using Ray as it's tensor parallel backend
and threw an error. vLLM apparently needs this to be unset.

Moved the logging configuration into RayPPOTrainer `init`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants