-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Add ExecutionPlan design. #6078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -143,3 +143,13 @@ message BlockDesc { | |
| // https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/program.md | ||
| // for more details. | ||
| message ProgramDesc { repeated BlockDesc blocks = 1; } | ||
|
|
||
| message OpPlacement { | ||
| optional string name = 1; | ||
| optional string device = 2; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am also wondering if device info for Operator is enough.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
So, this a only one case which we should set device for variable. For other cases, the variable device can be decided by operator's device info.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @QiJune thanks, great question, I guess we need
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Isn't the data initially CPU, and copied to GPU implicitly when needed, since we don't do explicit copies, maybe we don't need
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not put
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we also need to allow users the specify the device information by two approaches:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The In the future when we have that API we can add it to
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's fine to add it to
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @typhoonzero I think Since there are two entities: In the future when we want enable the user to configure which device an OP runs, we can put the field indicating device in Maybe I need to change message ExecutionPlan {
repeated BlockDesc blocks = 1;
repeated OpPlacement op_placement = 2;
}What do you think? |
||
| } | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we can add a detail example in comment. message OpPlacement {
// pserver:gpu0
optional string name = 1;
optional string device = 2;
}
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the "pserver" in "pserver:gpu0" is not necessary, the executor does not need to know what role (e.g., pserver) it takes. Maybe only "gpu0" is sufficient.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. bit confused how would
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| message ExecutionPlan { | ||
| optional ProgramDesc program = 1; | ||
|
||
| repeated OpPlacement op_placement = 2; | ||
| } | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, how to find the correspondence between OpPlacement in ExecutionPlan and OpDesc in ProgramDesc?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The number will be the same, each OP will have one placement. The order does not have to be the same, otherwise the "name" field in
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Program{Block{Op}}. A Program has many blocks. A block has many ops. However, the Program has many operator placements. We cannot get a one-to-one map by this data structure.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Sorry I don't fully get this point, I thought different OPs have different names? |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
available devices for distributed training are dynamic, should this plan be generated every time when available devices change (device added/removed/updated)? how are we going to efficiently deploy it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this should be generated every time when available devices change. Currently in distributed training we can have a constant number of trainers/pservers, I think it's a good starting point.