Skip to content
Closed
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
3331bda
submit job
Yancey0623 Apr 11, 2017
7faa331
small png
Yancey0623 Apr 11, 2017
718f901
update
Yancey0623 Apr 15, 2017
fd9e1c2
add paddlepaddle commands
Yancey0623 Apr 15, 2017
c095a93
update png
Yancey0623 Apr 15, 2017
d74d9ba
udpate png
Yancey0623 Apr 15, 2017
f4c7bd2
adjust sytle
Yancey0623 Apr 15, 2017
bb7263f
update submit-job
Yancey0623 Apr 18, 2017
005c3e1
update
Yancey0623 Apr 18, 2017
0f113d3
update
Yancey0623 Apr 20, 2017
b6969f9
update paddle server
Yancey0623 Apr 21, 2017
1771707
update
Yancey0623 Apr 25, 2017
a21743a
update
Yancey0623 Apr 25, 2017
d643295
resize image
Yancey0623 Apr 25, 2017
5827dc1
update
Yancey0623 Apr 25, 2017
02d18b2
update
Yancey0623 Apr 27, 2017
68ff895
update
Yancey0623 Apr 27, 2017
1fd8900
udpate image location
Yancey0623 Apr 27, 2017
a57bb04
update
Yancey0623 Apr 28, 2017
1987b45
rename direcotry
Yancey0623 Apr 28, 2017
6f097a3
update
Yancey0623 Apr 29, 2017
bfdd1a3
trainer use replicaset instead of statefulset
Yancey0623 May 4, 2017
b56e7e7
update
Yancey0623 May 4, 2017
b8e63d9
update
Yancey0623 May 5, 2017
6cbf80d
update
Yancey0623 May 5, 2017
cb39a81
update
Yancey0623 May 6, 2017
063805d
update
Yancey0623 May 6, 2017
e2e6875
update
Yancey0623 May 6, 2017
5ec1deb
update
Yancey0623 May 6, 2017
53b5afa
update
Yancey0623 May 6, 2017
198d0d1
update
Yancey0623 May 7, 2017
838509b
update
Yancey0623 May 9, 2017
259731a
update
Yancey0623 May 11, 2017
a5a0aeb
update
Yancey0623 May 12, 2017
080e633
trainer function
Yancey0623 May 12, 2017
05d6e00
delete specify resource paragraph
Yancey0623 May 12, 2017
8486227
paramter image instead of base_image and runtime_image
Yancey0623 May 12, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
125 changes: 125 additions & 0 deletions doc/design/cluster_train/submit-job.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Submit a Distributed Training Job

The user can submit a distributed training job with Python code, rather than with a command-line interface.

## Runtime Environment On Kubernetes

For a distributed training job, there is two Docker image called *runtime Docker image* and *base Docker image*. The runtime Docker image is the Docker image that gets scheduled by Kubernetes to run during training. The base Docker image is for building the runtime Docker image.

### Base Docker Image

Usually, the base Docker image is PaddlePaddle product Docker image including paddle binary files and python package. And of course, users can specify any image name hosted on any docker registry which users have the access right.

### Runtime Docker Image

The trainer package which user upload and some Python dependencies are packaged into a runtime Docker image based on base Docker image.

- Handle Python Dependencies

You need to provide requirements.txt file in your `trainer-package` folder. Example:

```txt
pillow
protobuf==3.1.0
```
More [details](https://pip.readthedocs.io/en/1.1/requirements.html) about requirements, an example project looks like:
```bash
paddle_example
|-quick_start
|-trainer.py
|-dataset.py
|-requirements.txt
```

## Submit Distributed Training Job With Python Code
<img src="./src/submit-job-python.png" width="800">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a paragraph to describe the flow in a big picture. The paragraph below goes directly to the detail of paddle.dist_train, the reader needs a big picture to follow the concept.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker build / Docker push应该是箭头那根线上的东西吧,这些都是动作,感觉应该放在线上,而不是图上(这里的图基本都是名词/一个概念)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


- `paddle.job.dist_train()` will call the Job Server API `/v1/packages` to upload the trainer package and save them on CephFS, and then call `/v1/trainer/job` to submit the PaddlePaddle distributed job.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

能不能查一下为什么paddle.job.dist_train()这些都没被正常的语法高亮,显示出来的是这样的:
screen shot 2017-05-05 at 11 40 19 am

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Miss a blank line...Done.

Copy link
Contributor

@helinwang helinwang May 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为了把我们第一个版本说得具体,能不能这样改:change "The first version, we only implement submit the PaddlePaddle job in paddle.job.dist_train()" to "For the first version, we will not prepare the runtime docker image, instead, we will mount the trainer package in a temporary folder into the base docker image. We will not support custom Python dependencies in the first version as well."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- `/v1/trainer/job` will start a building job for preparing the runtime Docker image. When the building job is finished, Job Server will submit the PaddlePaddle distributed job to Kubernetes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

第一版本是通过JobServer来submit job,还是本地Python直接来?

Copy link
Contributor Author

@Yancey0623 Yancey0623 May 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

第一版就本地直接提交了,本地提交需要build runtime Docker image,这里补充了一句。
Done.

- *NOTE*: For the first version, we will not prepare the runtime docker image, instead, we will mount the trainer package in a temporary folder into the base docker image. We will not support custom Python dependencies in the first version as well.

You can call `paddle.job.dist_train` and provide distributed training configuration as the parameters:
```python
paddle.job.dist_train(
trainer=paddle.trainer.SGD(...,
paddle.updater.Adam(...)),
reader=reader,
paddle_job=PaddleJob(
pserver_bucket="standard",
base_image="yancey1989/paddle-cloud",
job_name="paddle-job",
namespace="paddle-cloud",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不理解为什么需要namespace,我以为job_name就可以成为namespace了?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在Kubernetes集群上的做法是为每个用户创建一个namespace,一个namespace下会跑多个Job。 不过本地提交可以在~/.kube/config中获取namespace,PaddleCloud也可以通过环境变量来获取,所以这个参数可以去掉了:)
Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

另外我在实现PaddleJob的时候发现dist_train可能只需要trainer, paddle_job两个参数即可,其中tariner是一个的function, PR中做了说明。
Done.

cpu_num=3,
memory="1G"
trainer_package="/example/word2vec",
entry_point="python %s" % __file__))
```

The pseudo code of `paddle.job.dist_train` is as follows:
```python
def dist_train(trainer, reader, num_passes=1, event_handler=None, feeding=None, paddle_job=None):
# if the code is running on cloud, set PADDLE_ON_CLOUD=YES
if os.getenv("PADDLE_ON_CLOUD", "NO") == "NO":
#submit the paddle job
paddle_job.submit()
else:
#start the training
trainer.train(reader, num_passes, event_handler, feeding)
```

parameter | required | default | explanation
--- | --- | --- | ---
job_name|YES||the unique name for the training job
entry_point|YES|| entry point for startup trainer process
trainer_package|YES|| trainer package file path which user have the access right
base_image|YES||the [base image](#base-docker-image) for building the [runtime image](#runtime-docker-image)
memory|YES|| memory allocated for each trainer
cpu_num|YES|1| CPU count for the trainers totally used
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

光光cpu_num这个参数好像不够:
比如说这里写了5,是说起1个pod有5个cpu还是说起5个pod每个pod有1个cpu?变成这样可能好点(对Python ideom不熟,写Go做举例了):

type CPUTrainConfig struct {
  totalTrainer int
  totalGPU int
}

type GPUTrainConfig struct {
  totalGPU int
  totalCPU int
}

写成两个结构体是为了避免用户同时设置totalTrainer, totalGPU, totalCPU,同时设定这三个值就没有意义了:GPU train的时候,total trainer是用户不可知的:比如totalGPU=8,可能是8个GPU,2个trainer,也可能是8个GPU,8个trainer,这个由我们的调度系统去优化。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

光光cpu_num这个参数好像不够:
比如说这里写了5,是说起1个pod有5个cpu还是说起5个pod每个pod有1个cpu

CPU的total trainer可以和GPU一样,由我们的调度系统来优化,如果觉得比较复杂,我们可以在这一版分开指定。

在CPU模式下,如果指定了total trainer,还是指定每个trainer的CPU数量比较好理解一些,否则的话还要根据total trainer 和total CPU做除法用户才能知道每个trainer的CPU数量, 这样 CPU模式下需要配置trainerscpu_num两个参数,GPU模式下需要配置total_gpu一个参数

  • CPU Mode
    • You must specify trainers, the trainer count.
    • You must specify cpu_num, the CPU count for each trainer
  • GPU mode
    • You must specify total_gpu, the GPU count fo all trainers totally used.
    • You haven't specify trainers, this will be allocated by PaddlePaddle Job.

也请 @typhoonzero 关注下这个讨论。

Copy link
Contributor

@helinwang helinwang May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yancey1989 @helinwang I wrote some of my thoughts at #2047

Required arguments are:

parallelism: parallelism equals to num of trainer, the num of pservers is caculated from parallelism.
num_gpus: gpu resources needed, if num_gpus ==0 and env "PADDLE_USE_GPU" set to True or the opposite, paddle will throw a warning message when submitting a job.
num_cpus: cpu resource

The function call dist_train() will generate Kubernetes "YAML" files according to these three arguments. User need to specify "parallelism".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @typhoonzero ,I make some comments about #2047 just now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@helinwang 刚才和 @typhoonzero 线下讨论了一波,更新在了 #2047 ,也请看一下:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, 支持指定整个Job的资源或者分别指定trainer和pserver的资源两种方式。

gpu_num|NO|0| GPU count for the trainers totally used
pserver_bucket|NO|mini| you can specify [pserver-bucket](#pserver-resource) for the PServer resource

### specify Resource for a Distributed Training Job
- PServer Resource
- specify `pserver_bucket`
- `pserver_bucket=single`, a single PServer process, it's suitable for learning how to Paddle Cloud.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to Paddle Cloud

how to use Paddle Cloud?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

去掉了这个参数,这个表述不太清楚,意思是说Paddle Cloud根据集群架构以及资源情况来自动计算出PServer的数量。

- `pserver_bueckt=medium`, many PServer processes.
- `pserver_bucket=large`, large PServer processes

- Trainer Resource
- you *may* specify `gpu_num`, the trainers totally used. By default, trainer count equal GPU count.
- you *must* specify `cpu_num`, the trainers totally used. if `gpu_num=0`, trainer count equal CPU count.
- you *must* specify `memory`, memory allocated for each trainer, you can express memory as a plain integer using one of these suffixes: E, P, T, G, M, K.

### Deploy Parameter Server, Trainer and Master Process
- Deploy PaddlePaddle Parameter Server processes, it's a Kubernetes ReplicaSet.
- Deploy PaddlePaddle Trainer processes, it's a Kubernetes Job.
- Deploy PaddlePaddle Master processes, it's a Kubernetes ReplicaSet.

## Job Server
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

第一版本需要实现job server吗?https://github.com/PaddlePaddle/cloud/wiki/2017-05 中提到了“查看训练状态”是指网站通过job server的API取得信息,显示给用户看任务列表和任务log吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一版是想在网站后台直接实现JobServer的部分接口,例如查看训练状态和任务列表。


- RESTful API

Job server provides RESTful HTTP API for receiving the trainer package and displaying
PaddlePaddle job related informations.
- `POST /v1/package` receive the trainer package and save them on CephFS
- `POST /v1/trainer/job` submit a trainer job
- `GET /v1/jobs/` list all jobs
- `GET /v1/jobs/<job-name>` the status of a job
- `DELETE /v1/jobs/<job-name>` delete a job
- `GET /v1/version` job server version

- Build Runtime Docker Image on Kubernetes

`paddle.job.dist_train` will upload the trainer package to Job Server, save them on the distributed filesystem, and then start up a job for building the runtime Docker image that gets scheduled by Kubernetes to run during training.

There are some benefits for building runtime Docker image on JobServer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们最后觉得是在JobServer上build image还是在本地build image?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一期因为所有代码都是在云端存储上,所以就不build image了直接copy trainer_package比较简单。根据之前的讨论后续还是在JobServer上来build runtime Docker image感觉比较合适。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- On Paddle Cloud, users will run the trainer code in a Jupyter Notebook which is a Kubernetes Pod, if we want to execute `docker build` in the Pod, we should mount the host's `docker.sock` to the Pod, user's code will connect the host's Docker Engine directly, it's not safe.
- Users only need to upload the training package files, does not need to install docker engine, docker registry as dependencies.
- If we want to change another image type, such as RKT, users do not need to care about it.

- Deploy Parameter Server, Trainer and Master Processes

`POST /v1/trainer/job` receives the distributed training parameters, and deploy the job as follows:
- Deploy PaddlePaddle Parameter Server processes, it's a Kubernetes ReplicaSet.
- Deploy PaddlePaddle Trainer processes, it's a Kubernetes Job.
- Deploy PaddlePaddle Master processes, it's a Kubernetes ReplicaSet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

只有一个master,需要ReplicaSet吗?(或者什么别的更合适一点),我不是特清楚部署的时候用哪个比较好,请教一下。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReplicaSe比较简单了,主要是为了当Master节点挂掉时,Kubernetes会重新将Master调度起来,用Pod将会失去这个特性。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,明白了!