[fix] Fix the pretty trainer logging#270
Conversation
There was a problem hiding this comment.
Code Review
This pull request aims to fix logging for Ray workers by moving the configuration logic out of the problematic worker_process_setup_hook. A new function, configure_ray_worker_logging, has been added to handle log formatting and routing.
However, there is a critical issue with the current implementation. The logging configuration function is called from RayPPOTrainer.__init__, which executes on the driver process. This means the logging for the actual Ray workers (the remote actors) will not be configured, and the fix will not have the intended effect. I've provided a critical review comment to address this by moving the function call to the worker initialization logic, which is essential for this fix to work as described.
SumanthRH
left a comment
There was a problem hiding this comment.
LGTM, let's just verify with the GSM8K example
|
Yes, done. And ran the test suite that failed previously |
Re-implement the logging fix of #250 that was reverted in #261 The issue was that using the `worker_process_setup_hook` to set logging behavior interfered with vLLM using Ray as it's tensor parallel backend and threw an error. vLLM apparently needs this to be unset. Moved the logging configuration into RayPPOTrainer `init`.
Re-implement the logging fix of #250 that was reverted in #261
The issue was that using the
worker_process_setup_hookto set logging behavior interfered with vLLM using Ray as it's tensor parallel backend and threw an error. vLLM apparently needs this to be unset.Moved the logging configuration into RayPPOTrainer
init.