Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 21 additions & 2 deletions doc/design/graph.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Design Doc: Computations as Graphs
# Design Doc: Computations as a Graph

A primary goal of the refactorization of PaddlePaddle is a more flexible representation of deep learning computation, in particular, a graph of operators and variables, instead of sequences of layers as before.

Expand All @@ -8,6 +8,8 @@ This document explains that the construction of a graph as three steps:
- construct the backward part
- construct the optimization part

## The Construction of a Graph
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, have some comment for the part that is not from this PR:

optimize(cost)
train(cost, reader=mnist.train())

I think train should use the var returned by optimizer as argument, not cost. For example if two optimizer is connected with the cost, only specifying the cost the engine would have confusion of with optimizer to run.

Copy link
Collaborator Author

@wangkuiyi wangkuiyi Sep 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the training needs 1) the cost, and 2) the parameter to be optimized to minimize the cost.

The cost is specified in the invocation to train.

Parameters could be created by a layer function like layer.fc, or the user via W = paddle.Var(type=parameter, ...). Anyway, they are marked parameters and can be updated.

So both cost and parameter are known prior to training. What do you think about this approach?

Copy link
Contributor

@helinwang helinwang Sep 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the training needs 1) the cost, and 2) the parameter to be optimized to minimize the cost.

I think it need the optimizer as well (Adam or Adagrad).
For example, it user do something like:

opt0 = pd.Adam(cost)
opt1 = pd.Adagrad(cost)
train(cost, reader=mnist.train())

What optimizer will Paddle use for training? Maybe the code below is more concise:

opt0 = pd.Adam(cost)
opt1 = pd.Adagrad(cost)
train(opt1, reader=mnist.train())

However, I just realized the Python code you wrote is perhaps the V2 API, which maybe only allow one optimizer to be connected with the cost.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. What I mean is that we can have two forms of Block::Eval:

  1. One accepts targets of type Variables:

    void Block::Eval(vector<Variable*> targets);

    which is used to do forward computation. It traces only operators in BlockDesc::ops before targets.

    1. Forward computation: Because our Python API doesn't expose gradient variables to users, targets have to be forward variables, so this form of Block::Eval works only with forward computation.

    2. Backward computation: In the C++ world, Block::Eval can accept gradient variables as its targets. We can create a Python API function, say backward, which calls Block::Eval with gradient variables to do the backward computation.

  2. The other form of Block::Eval accepts targets as operators:

    void Block::Eval(vector<Operator*> targets);

    Somewhere in the C++ world, we can enumerate all optimization operators and use them as the target, so could we run the optimization step.


Let us take the problem of image classification as a simple example. The application program that trains the model looks like:

```python
Expand All @@ -25,7 +27,9 @@ The first four lines of above program build the forward part of the graph.

![](images/graph_construction_example_forward_only.png)

In particular, the first line `x = layer.data("images")` creates variable x and a Feed operator that copies a column from the minibatch to x. `y = layer.fc(x)` creates not only the FC operator and output variable y, but also two parameters, W and b.
In particular, the first line `x = layer.data("images")` creates variable x and a Feed operator that copies a column from the minibatch to x. `y = layer.fc(x)` creates not only the FC operator and output variable y, but also two parameters, W and b, and the initialization operators.

Initialization operators are kind of "run-once" operators -- the `Run` method increments a class data member counter so to run at most once. By doing so, a parameter wouldn't be initialized repeatedly, say, in every minibatch.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run-once maybe not a very good choice. Because sometimes the user may want to reinitialize the params. Maybe we should think out some better way to do it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It would be great if we can have another solution. How about we keep the run-once operator as a viable solution right now, and update it later after we got a better idea?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3862 (comment) could solve it :)


In this example, all operators are created as `OpDesc` protobuf messages, and all variables are `VarDesc`. These protobuf messages are saved in a `BlockDesc` protobuf message.

Expand All @@ -49,3 +53,18 @@ According to the chain rule of gradient computation, `ConstructBackwardGraph` wo
For each parameter, like W and b created by `layer.fc`, marked as double circles in above graphs, `ConstructOptimizationGraph` creates an optimization operator to apply its gradient. Here results in the complete graph:

![](images/graph_construction_example_all.png)
Copy link
Contributor

@helinwang helinwang Sep 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should call A depends on B only if in every step running A requires running B. For example, we probably should not call "MSE" depends on "init" (however, according to the dependency chain, currently "MSE" depends on "init" in the graph). Otherwise we need to come up a way to let "init" only run once while doing training.

In my opinion we need two kinds of directed edges. One for dependency, one for data flow. And maybe for discussion we don't need to draw the intermediate variable. In the graph below the dotted line is data flow, the full line is dependency. In this representation, there is no cycle in the graph, and "MSE" no longer depends on "init".

User can call "init all" to do initialize, and call training later (which does not do init again, since there is no dependency).

screen shot 2017-09-04 at 4 45 57 pm

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I love this idea and the figure! I agree that there are two kinds of dependencies -- the data dependency and the execution dependency. Currently, we treat them as the same and represent them by the order of operators in array repeated OpDesc ops in protobuf message BlockDesc.

I am not sure if it is necessary to explicitly describe these two kinds of dependencies in our protobuf messages. A reason is that I am not sure what InitAll is -- is it a Var like those returned by operator binding functions, or is it an operator?

Copy link
Contributor

@helinwang helinwang Sep 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have make "init all" more clear. It's an OP that joins / merges all the dependency: it will run when all its dependencies are done, it's does nothing itself (only used to join the dependency). Maybe we can call it join or merge.

The reason behind why we need to explicitly describe these two kinds of dependencies is: the PaddlePaddle scheduler only need to schedule OP to run according to the dependency constraint (data flow is no longer a scheduling constraint). For example, in this case, even though "init" writes to var "B" (data flows from "init" to "B"), var "B" no longer depends on "init", so when doing optimization, "init" will not be scheduled.

Copy link
Contributor

@helinwang helinwang Sep 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another solution is TF's solution: there is no type var in the graph. A graph only has OP, and every directed edge is a tensor rather than var. A var is represented by a "var OP", which only have output (output the handle for read / write), but no input:
screen shot 2017-09-05 at 12 47 42 pm

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Record the temporary conclusions from offline discussions:

  1. TensorFlow's graph representation embeds variables into operators, and
  2. requires users specify input, output, and dependent operators for each operator.

The specification of dependencies looks ugly. So let's follow our current design of using variables and operators.


## Block and Graph

The word block and graph are interchangable in the desgin of PaddlePaddle. A [Block[(https://github.com/PaddlePaddle/Paddle/pull/3708) is a metaphore of the code and local variables in a pair of curly braces in programming languages, where operators are like statements or instructions. A graph of operators and variables is a representation of the block.

A Block keeps operators in an array `BlockDesc::ops`

```protobuf
message BlockDesc {
repeated OpDesc ops = 1;
repeated VarDesc vars = 2;
}
```

in the order that there appear in user programs, like the Python program at the beginning of this article. We can imagine that in `ops`, we have some forward operators, followed by some gradient operators, and then some optimization operators.
4 changes: 4 additions & 0 deletions doc/design/images/graph_construction_example.dot
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@ digraph ImageClassificationGraph {
///////// The forward part /////////
FeedX [label="Feed", color=blue, shape=box];
FeedY [label="Feed", color=blue, shape=box];
InitW [label="Init", color=blue, shape=diamond];
Initb [label="Init", color=blue, shape=diamond];
FC [label="FC", color=blue, shape=box];
MSE [label="MSE", color=blue, shape=box];

Expand All @@ -14,6 +16,8 @@ digraph ImageClassificationGraph {

FeedX -> x -> FC -> y -> MSE -> cost [color=blue];
FeedY -> l [color=blue];
InitW -> W [color=blue];
Initb -> b [color=blue];
W -> FC [color=blue];
b -> FC [color=blue];
l -> MSE [color=blue];
Expand Down
Binary file modified doc/design/images/graph_construction_example_all.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified doc/design/images/graph_construction_example_forward_only.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.