fix MultiGradientMachine train and infer #2595
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
fix: #2534 #2565
Problem:
when training or infering with python v2 api, if trainer_count > 1 and call trainer.train or inferer.infer multiple times, the process will hang.
Reason:
when trainer_count > 1, paddle will use MultiGradientMachine and will start multiple worker threads to do the forward/backward work(thread number is trainer_count)
In v2 python API, trainer or inferer will call gradinet_machine.finish() after train/infer, this will stop the worker threads, when you call trianer.train or inferrer.infer the second time, there are no worker thread to handle the task, so it hangs there.
Fix:
Do not close worker threads when gradientMachine.finish() is called, close them in the destructor of gradientMachine.