fix MultiGradientMachine train and infer #2595

jacquesqiao · 2017-06-25T05:19:05Z

Problem:

when training or infering with python v2 api, if trainer_count > 1 and call trainer.train or inferer.infer multiple times, the process will hang.

Reason:

when trainer_count > 1, paddle will use MultiGradientMachine and will start multiple worker threads to do the forward/backward work(thread number is trainer_count)

In v2 python API, trainer or inferer will call gradinet_machine.finish() after train/infer, this will stop the worker threads, when you call trianer.train or inferrer.infer the second time, there are no worker thread to handle the task, so it hangs there.

Fix:

Do not close worker threads when gradientMachine.finish() is called, close them in the destructor of gradientMachine.

typhoonzero · 2017-06-25T23:41:44Z

Just to confirm, will swig call c++'s defunctor when a python object is released?

reyoung · 2017-06-26T02:47:39Z

paddle/gserver/gradientmachines/MultiGradientMachine.cpp

  }
 }

-void MultiGradientMachine::finish() {


Why remove finish?

After move thread->stop() to destructor, finish() seems has no use.

Have add finish back

reyoung · 2017-06-26T02:48:35Z

paddle/gserver/gradientmachines/MultiGradientMachine.cpp

  }
 }

+MultiGradientMachine::~MultiGradientMachine() {


Maybe in destructor, we should check whether MultiGradientMachine is finished or not?

If MultiGradientMachine is not finish, then invoke this->finish();

jacquesqiao · 2017-06-26T06:26:52Z

Just to confirm, will swig call c++'s defunctor when a python object is released?

@typhoonzero not sure, I will check it.

fix MultiGradientMachine train and infer

55684af

jacquesqiao requested review from juliecbd, lcy-seso and reyoung June 25, 2017 05:29

reyoung requested changes Jun 26, 2017

View reviewed changes

use GradientMachine::start and finish

9f05a0f

reyoung approved these changes Jun 27, 2017

View reviewed changes

jacquesqiao merged commit bf57345 into PaddlePaddle:develop Jun 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix MultiGradientMachine train and infer #2595

fix MultiGradientMachine train and infer #2595

Uh oh!

jacquesqiao commented Jun 25, 2017 •

edited

Loading

Uh oh!

typhoonzero commented Jun 25, 2017

Uh oh!

reyoung Jun 26, 2017

Uh oh!

jacquesqiao Jun 26, 2017

Uh oh!

jacquesqiao Jun 26, 2017

Uh oh!

reyoung Jun 26, 2017

Uh oh!

jacquesqiao commented Jun 26, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix MultiGradientMachine train and infer #2595

fix MultiGradientMachine train and infer #2595

Uh oh!

Conversation

jacquesqiao commented Jun 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem:

Reason:

Fix:

Uh oh!

typhoonzero commented Jun 25, 2017

Uh oh!

reyoung Jun 26, 2017

Choose a reason for hiding this comment

Uh oh!

jacquesqiao Jun 26, 2017

Choose a reason for hiding this comment

Uh oh!

jacquesqiao Jun 26, 2017

Choose a reason for hiding this comment

Uh oh!

reyoung Jun 26, 2017

Choose a reason for hiding this comment

Uh oh!

jacquesqiao commented Jun 26, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jacquesqiao commented Jun 25, 2017 •

edited

Loading