-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Update graph construction design doc #3862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,4 @@ | ||
| # Design Doc: Computations as Graphs | ||
| # Design Doc: Computations as a Graph | ||
|
|
||
| A primary goal of the refactorization of PaddlePaddle is a more flexible representation of deep learning computation, in particular, a graph of operators and variables, instead of sequences of layers as before. | ||
|
|
||
|
|
@@ -8,6 +8,8 @@ This document explains that the construction of a graph as three steps: | |
| - construct the backward part | ||
| - construct the optimization part | ||
|
|
||
| ## The Construction of a Graph | ||
|
|
||
| Let us take the problem of image classification as a simple example. The application program that trains the model looks like: | ||
|
|
||
| ```python | ||
|
|
@@ -25,7 +27,9 @@ The first four lines of above program build the forward part of the graph. | |
|
|
||
|  | ||
|
|
||
| In particular, the first line `x = layer.data("images")` creates variable x and a Feed operator that copies a column from the minibatch to x. `y = layer.fc(x)` creates not only the FC operator and output variable y, but also two parameters, W and b. | ||
| In particular, the first line `x = layer.data("images")` creates variable x and a Feed operator that copies a column from the minibatch to x. `y = layer.fc(x)` creates not only the FC operator and output variable y, but also two parameters, W and b, and the initialization operators. | ||
|
|
||
| Initialization operators are kind of "run-once" operators -- the `Run` method increments a class data member counter so to run at most once. By doing so, a parameter wouldn't be initialized repeatedly, say, in every minibatch. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. run-once maybe not a very good choice. Because sometimes the user may want to reinitialize the params. Maybe we should think out some better way to do it.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed. It would be great if we can have another solution. How about we keep the run-once operator as a viable solution right now, and update it later after we got a better idea?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. #3862 (comment) could solve it :) |
||
|
|
||
| In this example, all operators are created as `OpDesc` protobuf messages, and all variables are `VarDesc`. These protobuf messages are saved in a `BlockDesc` protobuf message. | ||
|
|
||
|
|
@@ -49,3 +53,18 @@ According to the chain rule of gradient computation, `ConstructBackwardGraph` wo | |
| For each parameter, like W and b created by `layer.fc`, marked as double circles in above graphs, `ConstructOptimizationGraph` creates an optimization operator to apply its gradient. Here results in the complete graph: | ||
|
|
||
|  | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should call A depends on B only if in every step running A requires running B. For example, we probably should not call "MSE" depends on "init" (however, according to the dependency chain, currently "MSE" depends on "init" in the graph). Otherwise we need to come up a way to let "init" only run once while doing training. In my opinion we need two kinds of directed edges. One for dependency, one for data flow. And maybe for discussion we don't need to draw the intermediate variable. In the graph below the dotted line is data flow, the full line is dependency. In this representation, there is no cycle in the graph, and "MSE" no longer depends on "init". User can call "init all" to do initialize, and call training later (which does not do init again, since there is no dependency).
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it. I love this idea and the figure! I agree that there are two kinds of dependencies -- the data dependency and the execution dependency. Currently, we treat them as the same and represent them by the order of operators in array I am not sure if it is necessary to explicitly describe these two kinds of dependencies in our protobuf messages. A reason is that I am not sure what InitAll is -- is it a Var like those returned by operator binding functions, or is it an operator?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I should have make "init all" more clear. It's an OP that joins / merges all the dependency: it will run when all its dependencies are done, it's does nothing itself (only used to join the dependency). Maybe we can call it join or merge. The reason behind why we need to explicitly describe these two kinds of dependencies is: the PaddlePaddle scheduler only need to schedule OP to run according to the dependency constraint (data flow is no longer a scheduling constraint). For example, in this case, even though "init" writes to var "B" (data flows from "init" to "B"), var "B" no longer depends on "init", so when doing optimization, "init" will not be scheduled.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Record the temporary conclusions from offline discussions:
The specification of dependencies looks ugly. So let's follow our current design of using variables and operators. |
||
|
|
||
| ## Block and Graph | ||
|
|
||
| The word block and graph are interchangable in the desgin of PaddlePaddle. A [Block[(https://github.com/PaddlePaddle/Paddle/pull/3708) is a metaphore of the code and local variables in a pair of curly braces in programming languages, where operators are like statements or instructions. A graph of operators and variables is a representation of the block. | ||
|
|
||
| A Block keeps operators in an array `BlockDesc::ops` | ||
|
|
||
| ```protobuf | ||
| message BlockDesc { | ||
| repeated OpDesc ops = 1; | ||
| repeated VarDesc vars = 2; | ||
| } | ||
| ``` | ||
|
|
||
| in the order that there appear in user programs, like the Python program at the beginning of this article. We can imagine that in `ops`, we have some forward operators, followed by some gradient operators, and then some optimization operators. | ||


There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, have some comment for the part that is not from this PR:
I think train should use the var returned by optimizer as argument, not cost. For example if two optimizer is connected with the cost, only specifying the cost the engine would have confusion of with optimizer to run.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the training needs 1) the cost, and 2) the parameter to be optimized to minimize the cost.
The cost is specified in the invocation to
train.Parameters could be created by a layer function like
layer.fc, or the user viaW = paddle.Var(type=parameter, ...). Anyway, they are marked parameters and can be updated.So both cost and parameter are known prior to training. What do you think about this approach?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it need the optimizer as well (Adam or Adagrad).
For example, it user do something like:
What optimizer will Paddle use for training? Maybe the code below is more concise:
However, I just realized the Python code you wrote is perhaps the V2 API, which maybe only allow one optimizer to be connected with the cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. What I mean is that we can have two forms of
Block::Eval:One accepts targets of type
Variables:which is used to do forward computation. It traces only operators in
BlockDesc::opsbeforetargets.Forward computation: Because our Python API doesn't expose gradient variables to users,
targetshave to be forward variables, so this form ofBlock::Evalworks only with forward computation.Backward computation: In the C++ world,
Block::Evalcan accept gradient variables as its targets. We can create a Python API function, saybackward, which callsBlock::Evalwith gradient variables to do the backward computation.The other form of
Block::Evalaccepts targets as operators:Somewhere in the C++ world, we can enumerate all optimization operators and use them as the target, so could we run the optimization step.