Skip to content
Closed
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
3331bda
submit job
Yancey0623 Apr 11, 2017
7faa331
small png
Yancey0623 Apr 11, 2017
718f901
update
Yancey0623 Apr 15, 2017
fd9e1c2
add paddlepaddle commands
Yancey0623 Apr 15, 2017
c095a93
update png
Yancey0623 Apr 15, 2017
d74d9ba
udpate png
Yancey0623 Apr 15, 2017
f4c7bd2
adjust sytle
Yancey0623 Apr 15, 2017
bb7263f
update submit-job
Yancey0623 Apr 18, 2017
005c3e1
update
Yancey0623 Apr 18, 2017
0f113d3
update
Yancey0623 Apr 20, 2017
b6969f9
update paddle server
Yancey0623 Apr 21, 2017
1771707
update
Yancey0623 Apr 25, 2017
a21743a
update
Yancey0623 Apr 25, 2017
d643295
resize image
Yancey0623 Apr 25, 2017
5827dc1
update
Yancey0623 Apr 25, 2017
02d18b2
update
Yancey0623 Apr 27, 2017
68ff895
update
Yancey0623 Apr 27, 2017
1fd8900
udpate image location
Yancey0623 Apr 27, 2017
a57bb04
update
Yancey0623 Apr 28, 2017
1987b45
rename direcotry
Yancey0623 Apr 28, 2017
6f097a3
update
Yancey0623 Apr 29, 2017
bfdd1a3
trainer use replicaset instead of statefulset
Yancey0623 May 4, 2017
b56e7e7
update
Yancey0623 May 4, 2017
b8e63d9
update
Yancey0623 May 5, 2017
6cbf80d
update
Yancey0623 May 5, 2017
cb39a81
update
Yancey0623 May 6, 2017
063805d
update
Yancey0623 May 6, 2017
e2e6875
update
Yancey0623 May 6, 2017
5ec1deb
update
Yancey0623 May 6, 2017
53b5afa
update
Yancey0623 May 6, 2017
198d0d1
update
Yancey0623 May 7, 2017
838509b
update
Yancey0623 May 9, 2017
259731a
update
Yancey0623 May 11, 2017
a5a0aeb
update
Yancey0623 May 12, 2017
080e633
trainer function
Yancey0623 May 12, 2017
05d6e00
delete specify resource paragraph
Yancey0623 May 12, 2017
8486227
paramter image instead of base_image and runtime_image
Yancey0623 May 12, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added doc/design/dist/submit-job-command-line.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/design/dist/submit-job-python.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
157 changes: 157 additions & 0 deletions doc/design/dist/submit-job.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
# Submit a Distributed Training Job

If a user wants to start up a local train, he will start up a PaddlePaddle product Docker container firstly, and then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

product -> production

execute `python train.py` in the Docker container.The details about PaddlePaddle Docker image is [here](../../../paddle/scripts/docker/README.md)

If a user wants to start up a distributed training job, he will submit the distributed training job in python code, or use a command line tool.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

把提交分布式任务放在前面,把提交本地的放在后面?这个文档的重点在描述如何提交分布式任务

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"or use a command line tool": 我理解的是现在只准备支持python code启动分布式训练?


The relation of PaddlePaddle, kubernetes and docker:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relation -> relationship

下面一行就直接是一级标题了?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, 改成二级标题了,多谢指正。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PaddlePaddle,Kubernetes and docker -> PaddlePaddle, kubernetes, and Docker

Kubernetes, Docker专有名词首字母需要大写,grammarly查语法问题很方便。



# Runtime Environment On kubernetes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该是二级标题?


For a distributed training job, there is two docker image called `runtime docker image` and `base docker image`, the `runtime docker image` is actually running in kubernetes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, the runtime docker image is actually running in kubernetes. -> . The runtime Docker image is the Docker image that gets scheduled by Kubernetes to run during training. The base image is for building the runtime image.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

英文里强调一个名字因该用斜体而不是code block。而且只是第一次介绍的时候需要斜体,其他时候就不用斜体了。
比如这里

`runtime docker image`

应该是

**runtime docker image**

而后面出现的runtime docker image就不用任何style修饰了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, 感谢指正。


- Base Docker Image

Usually, the `base docker image` is PaddlePaddle product docker image including paddle binary files and trainer startup script file. And of course, users can specify any image name hosted on any docker registry which users have the right access.

- Runtime Docker Image

Package the trainer package which user upload and some python dependencies into a `runtime docker image` base on `base docker image`, this is done automatically by Job Server.

- Python Dependencies

Users will provide a `requirments.txt` file in trainer packages, to list python dependencies packages, such as:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users will provide a requirments.txt file in trainer packages, to list python dependencies packages, such as:

You need to provide requirments.txt file in your "trainer" package. Example:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

```txt
pillow
protobuf==3.1.0
```
some other details about `requirements` is [here](https://pip.readthedocs.io/en/1.1/requirements.html).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some other details about requirements is here.

More details about requirements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


Here is an example project:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example project looks like:

```bash
paddle_example
|-quick_start
|-trainer.py
|-dataset.py
|-requirments.txt
```
Execute the command: `paddle train --trainer-package=./paddle_eample/quick_start ...`, PaddlePaddle client will upload the trainer package(quick_start)and setup parameters to [Job Server](#job-server)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

上次开会貌似达成一致的是先只支持python里面的调用,因为下面讲了python的代码,这里是不是就不需要了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, 去掉了命令行提交的部分。


## Submit a Distributed Training Job In Python Code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submit a Distributed Training Job In Python Code -> Submit Distributed Training Job With Python Code

<img src="./submit-job-python.png" width="800">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

job server是不是就和paddle cloud dashboard网站是同一个实例?


Users will call `paddle.dist_train` and provide distributed training configuration as the parameters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users will -> You can

```python
paddle.dist_train(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python提交和命令行提交的参数,增加下参数的默认值,可选,必选的说明

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

要不要重新搞个结构体来做这个事情,看起来直观些,切换远程和本地也比较容易。
这里应该改成python的代码。
我的理解是:

dist_train_setting = DistTrainSetting()
dist_train_setting.keyPath = "./key.pem"
dist_train_setting.trainers = 4
dist_train_setting.pservers=4
dist_train_setting.base_image="paddlepaddle/paddle:0.10.rc2"
dist_train_setting.use_gpu=False
dist_train_setting.job_name="quickstart"
paddle.train(..., dist_train_setting=dist_train_setting)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, demo PR: #1918

model,
trainer=paddle.trainer.SGD(...,
paddle.updater.Adam(...)),
reader=reader,
job_name="quickstart",
trainers=8,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@helinwang @typhoonzero 我们之前讨论到在GPU模式下可以直接使用gpu_num,由平台来优化调度。那么在CPU模式下,采用指定trainers还是cpu-num比较好呢?后者我们可以根据cpu-num来优化trainer并发数和内存的limit。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉前期可以这样设计:

if use_gpu == true:
    # 前期采用trainer的Pod使用一块GPU卡,优化方向是让一个trainer进程可以使用尽量多的GPU
    trainer_count = gpu_num
else:
    # 前期采用一个trainer进程使用一块CPU,优化方向是根据集群当前CPU的空闲情况,使一个Pod同时使用尽量多的CPU
    trainer_count = cpu_num
trainer_memory = memory / trainer_count
trainer_cpu = cpu_num / trainer_count

因为调度的优化需要在提交时感知集群当前的情况,所以在JobServer来实现看起来比较合适。

pservers=4,
input=/quickstart/input,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用户远程的根目录咱们自动mount到固定的位置,input,output目录就不用指定了吧。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里input和output分别是训练前用户上传数据的目录,以及训练后用户下载数据需要指定的目录。感觉还是需要让用户指定一个输出目录,因为可能会同时提交多个任务,不指定目录可能会比较乱。

output=/quickstart/output,
base_image="paddlepaddle/paddle:0.10.rc2",
use_gpu=False)
```

- Build Runtime Docker Image on Kubernetes

`paddle.dist_train` will deploy a kubernetes job, build and push runtime docker image in the pod. Parameter Server and Trainer pod will use the runtime docker image.

There are some benefits for building Docker image on the kubernetes:
- `Docker in Docker` should mount `docker.sock` in the container and set `--privileged`, if the code running in a kubernetes pod, it's not safety.
- Users only need to upload the training package files, does not dependency docker engine, docker registry.
- If we want to change another image type, such as RKT, the user does not need to care about it.

- Startup Parameter Server and Trainer Job
- Deploy parameter server job, it's a kubernetes StatefulSet.
- Deploy trainer job, it's a kubernetes Job.

## Submit a Distributed Training Job With a Command Line Tool
<img src="./submit-job-command-line.png" width="800">

- Configurate PaddlePaddle Client

Users should configure PaddlePaddle client by the configuration file firstly, the default path:
`$HOME/.paddle/config`.

```yaml
apiVersion: v1
dockerRegistry:
domain: domain.com //default is docker.io
username: <username>
password: <password>
jobServer: http://<job server domain>:<job server port>
```

- Submit a Distributed Training Job
Users will execute the command `paddle job submit` and provides distributed training configuration as the parameters.
```bash
paddle job submit\
--job-name=cluster-quickstart \
--trainer-package=$PWD/quick_start \
--entry-point="python train.py" \
--input=<input-dir> \
--output=<output-dir> \
--trainers=4 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的内容跟python调用的内容重复了。要不要这版本就先不要讨论通过command line启动PaddlePaddle。第一版肯定不支持,现在想可能太远了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

--pservers=2 \
--base-image:<paddle-image> \
--use-gpu=true \
--trainer-gpu-num=1 \
--env="NUM_PASS=5"
```
- `job-name`: you should specify a unique job name
- `trainer-package`: python package files on your host
- `entry-point`: an entry point for startup trainer process
- `input`: input directory on distributed file system
- `output`: output directory on distributed file system
- `trainers`: if `use-gpu=false`, users should configurate the trainer process count
- `pserver`: parameter process count
- `base-image`: your trainer docker image, include your trainer files and dependencies.
- `use-gpu`: whether it is a GPU train
- `trainer-gpu-num`: how much GPU card for one paddle trainer process, it's requirements only if `use-gpu=true`,
- `env`: environment variable

The command `paddle train` will package the trainer package to a `trainer.tar.gz` file, call `POST /v1/package` to upload the trainer package file. and then call `POST /v1/trainer/job` to start up a distributed job.

- PaddlePaddle Client Commands:
The command line tool also supports the following subcommands:
- `paddle train`: start a training job
- `paddle list`: list all PaddlePaddle jobs in current namespace
- `paddle cancel`: cancel a running job.
- `paddle status`: status of a PaddlePaddle job
- `paddle version`: show job client and job server version info.
- `paddle upload`: upload training data to distributed storage.
- `paddle download`: download training data from a distributed storage.


# Job Server
Job server is running on kubernetes, users will configure the server address in [PaddlePaddle client configuration file](#configurate-paddlepaddle-client)

- RESTful API

Job server provides a RESTful HTTP server receives the trainer packages, list PaddlePaddle job etc...
- `POST /v1/package` receive the trainer package and save them on GlustereFS
- `POST /v1/trainer/job` submit a trainer job
- `GET /v1/jobs/` list all job
- `GET /v1/jobs/<job-name>` the status of a job
- `DELETE /v1/jobs/<job-name>` cancel a job
- `GET /v1/version` job server version

- Build Runtime Docker Image On Kubernetes

Job Server deploys a kubernetes Job and builds runtime docker image in Pod, pserver and trainer pod will use this runtime docker image to startup pserver and trainer process.

- Start Up PSrvers and Trainers Job
- Deploy pserver job, it's a kubernetes StatefulSet.
- Deploy trainer job, it's a kubernetes Job.

# Work Feature
- V1
- Submit a distributed training job in python code.
- V2
- Submit a distributed training job with command line tool, build/push docker image and deploy pserver/trainer job in Job Server.