Skip to content
Closed
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
3331bda
submit job
Yancey0623 Apr 11, 2017
7faa331
small png
Yancey0623 Apr 11, 2017
718f901
update
Yancey0623 Apr 15, 2017
fd9e1c2
add paddlepaddle commands
Yancey0623 Apr 15, 2017
c095a93
update png
Yancey0623 Apr 15, 2017
d74d9ba
udpate png
Yancey0623 Apr 15, 2017
f4c7bd2
adjust sytle
Yancey0623 Apr 15, 2017
bb7263f
update submit-job
Yancey0623 Apr 18, 2017
005c3e1
update
Yancey0623 Apr 18, 2017
0f113d3
update
Yancey0623 Apr 20, 2017
b6969f9
update paddle server
Yancey0623 Apr 21, 2017
1771707
update
Yancey0623 Apr 25, 2017
a21743a
update
Yancey0623 Apr 25, 2017
d643295
resize image
Yancey0623 Apr 25, 2017
5827dc1
update
Yancey0623 Apr 25, 2017
02d18b2
update
Yancey0623 Apr 27, 2017
68ff895
update
Yancey0623 Apr 27, 2017
1fd8900
udpate image location
Yancey0623 Apr 27, 2017
a57bb04
update
Yancey0623 Apr 28, 2017
1987b45
rename direcotry
Yancey0623 Apr 28, 2017
6f097a3
update
Yancey0623 Apr 29, 2017
bfdd1a3
trainer use replicaset instead of statefulset
Yancey0623 May 4, 2017
b56e7e7
update
Yancey0623 May 4, 2017
b8e63d9
update
Yancey0623 May 5, 2017
6cbf80d
update
Yancey0623 May 5, 2017
cb39a81
update
Yancey0623 May 6, 2017
063805d
update
Yancey0623 May 6, 2017
e2e6875
update
Yancey0623 May 6, 2017
5ec1deb
update
Yancey0623 May 6, 2017
53b5afa
update
Yancey0623 May 6, 2017
198d0d1
update
Yancey0623 May 7, 2017
838509b
update
Yancey0623 May 9, 2017
259731a
update
Yancey0623 May 11, 2017
a5a0aeb
update
Yancey0623 May 12, 2017
080e633
trainer function
Yancey0623 May 12, 2017
05d6e00
delete specify resource paragraph
Yancey0623 May 12, 2017
8486227
paramter image instead of base_image and runtime_image
Yancey0623 May 12, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 137 additions & 0 deletions doc/design/dist/submit-job.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@

# PaddlePaddle Client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are describing the PaddlePaddle client, do you need to describe the full features of it, like local training, show version etc.

PaddlePaddle client is command line tool, you can use a PaddlePaddle client to start a local train and submit a distributed training job to kubernetes cluster.

The relation of PaddlePaddle, kubernetes and docker:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relation -> relationship

下面一行就直接是一级标题了?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, 改成二级标题了,多谢指正。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PaddlePaddle,Kubernetes and docker -> PaddlePaddle, kubernetes, and Docker

Kubernetes, Docker专有名词首字母需要大写,grammarly查语法问题很方便。


<img src="./submit-job.png" width="500">
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions with this figure:

  1. I am not sure that pservers and trainers should be in two jobs. In our current configuration, each trainer has a pserver running on the same physical node to optimally overlay networking and computing.

  2. paddle must communicate with Kubernetes' API server to start the job. It might communicate to the master process of the job too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I am not sure that pservers and trainers should be in two jobs

现在的配置会有两个问题:

  1. 不利于trainer的大规模训练,由于trainer count==pserver count,当trainer数很大时,我们并不需要同样数量的pserver数,因为这会增加pserver失败的概率以及增加网络负载。
  2. 由于启在同一个container里,pserver和trainer有一个会以后台方式启动,不符合container的设计原则,不利于故障检测和恢复。

each trainer has a pserver running on the same physical node to optimally overlay networking and computing

看起来控制pserver的数量更可以达到优化网络的效果,而且trainer需要和所有的pserver通信,所以只有一个pserver 节点启在本地看起来效果也不是很大?

paddle must communicate with Kubernetes' API server to start the job. It might communicate to the master process of the job too

我觉得 @helinwang 的这个comment是有道理的,我paddle可以只和master进行通信,master作为一个service存在,并且这个master和 https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/dist#master-process 这里的master 并不是同一个master, 我会在下一版的更新中修改这部分的描述。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

图片里的"start up a local training job"需要PaddlePaddle Client来做吗?感觉直接本地python train.py就行了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在本地训练需要docker run ... python train.py才可以,设计PaddlePaddle Client支持本地训练也是为了让用户不必学习Docker相关的操作。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

明白了,那这样用户还需要在下载docker image之外,另外下载一个运行脚本吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们线下讨论下这个问题吧,还有Queue的实现是否使用etcd也需要一起讨论下:)



# Running your training locally
Execute `paddle local train` to run your local train.
```bash
paddle local train
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first argument after paddle should be a command. local isn't even a verb. It seems that it could simply be paddle train --locally or just paddle train without Kubernetes related arguments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

--pcakage-path=./demo
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pcakage ==> package

--module=demo.train
--input=<input_dir>
--output=<output_dir>
--image=<paddle_image>
--e=NUM_PASS=4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--env == -e

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

```
- package-path: your trainer code python package
- module: include a main function, trainer entrance.
- input: input directory, for local train, it's a host path.
- output: output directory, for local train, it's a host path.
- image: paddlepaddle production image
- e: environment varible

When you start the local train, the client starts a docker container like:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

==> When users start a local training job ...

```bash
docker run --rm
-v <input_dir>:/train/input
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about change /train/{input,output,package} into /{input,output,package}`?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

-v <output_dir>:/train/output
-v <package-path>:/train/package
-e NUM_PASS=4 <paddle_image>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we also need -e "PYTHONPATH=/train/package so to enable Python finding imported packages there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

python /train/package/train.py
```


# Submit distributed training job
You can use `paddle submit job <job-name>` to submit a distributed training job.

```bash
paddle job submit train <job-name>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddle job submit train ==>
paddle train

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

--package-path=/train/quick_start
--module=quick_start.train
--input=<input_dir>
--output=<output_dir>
--trainers=4
--pservers=2
--image:<your image>
-e=NUM_PASS=5
```

- job-name: you should specify a unique job name,
- package-path=your python package files
- module: include the main function, trainer entrance
- input: input directory on distributed file system
- output: output directory on distributed file system
- trainers: trainer process count
- pserver: parameter process count
- image: your trainer docker image, include your trainer files and dependencies.
- command:
- e: environment variable

## Build your docker image
Before submitting a distributed training, you should build your docker image, here
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I expected is that paddle train pack everything into a Docker image, other than users pack them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

is a simple example project:
```bash
paddle_example
|-Dcokerfile
`-quick_start
|-trainer.py
`-dataset.py
```
Execute `docker build -t <your repo>/paddle_dist_example .` on directory `paddle_example` and then
push the image use `docker push <your repo>/paddle_dist_example`
`Dockerfile` should add your python package to the image:
```bash
FROM:paddlepaddle/paddle:0.10.0rc2
ADD ./quick_start /train/quick_start
CMD ["python", "/train/quick_stat/train.py"]
```

## Master process
Master process a bootstrap and manager process for a distributed job, it deploys parameter server process, trainer process and dispatch task for the trainers, it is implemented by golang.

- Setup master process

While user submits a distributed training job, PaddlePaddle client deploys a master process which is a Job resource naming `<job-name>-master` on kubernetes.

- Startup pservers and trainers

Master process will deploy pserver and trainer on kubernetes, they are also job resource, naming `<job-name>-pserver` and `<job-name>-trainer`. Because of trainer need the IP of pserver, so there should be a dependency for the startup order.
- Deploy pserver job, and waiting for the status becoming `RUNINIG`.
- Fetch all pserver's IP as trainer parameters.
- Deploy trainer job.

- Dispatch task to trainer

Detail description is [here](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/dist#master-process)

## Data source
- Distributed file system

You can upload your training data to distributed file system, such as GlustereFS,
PaddlePaddle support a default reader for reading data from distributed file system.
- HTTP server

TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's already mentioned in #1696 , maybe we can give a general introduction (as you already did) and reference there after that PR is merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this section :)

- Real-time data

TODO

## PaddlePaddle client commands:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think running in client just let user do python train.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- `paddle local`

- `paddle local train`

start up a local train.
- `paddle job`

Start up a distributed job
- `paddle job submit`

submit a PaddlePaddle distributed training using kubectl.
- `paddle job status`

check the job status
- `paddle job list`

list existing PaddlePadle distributed job on kubernetes
- `paddle job cancel`

cancel a running PaddlePaddle distributed job.
- `paddle version`

show PaddlePaddle client version
Binary file added doc/design/dist/submit-job.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.