Skip to content
Closed
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
3331bda
submit job
Yancey0623 Apr 11, 2017
7faa331
small png
Yancey0623 Apr 11, 2017
718f901
update
Yancey0623 Apr 15, 2017
fd9e1c2
add paddlepaddle commands
Yancey0623 Apr 15, 2017
c095a93
update png
Yancey0623 Apr 15, 2017
d74d9ba
udpate png
Yancey0623 Apr 15, 2017
f4c7bd2
adjust sytle
Yancey0623 Apr 15, 2017
bb7263f
update submit-job
Yancey0623 Apr 18, 2017
005c3e1
update
Yancey0623 Apr 18, 2017
0f113d3
update
Yancey0623 Apr 20, 2017
b6969f9
update paddle server
Yancey0623 Apr 21, 2017
1771707
update
Yancey0623 Apr 25, 2017
a21743a
update
Yancey0623 Apr 25, 2017
d643295
resize image
Yancey0623 Apr 25, 2017
5827dc1
update
Yancey0623 Apr 25, 2017
02d18b2
update
Yancey0623 Apr 27, 2017
68ff895
update
Yancey0623 Apr 27, 2017
1fd8900
udpate image location
Yancey0623 Apr 27, 2017
a57bb04
update
Yancey0623 Apr 28, 2017
1987b45
rename direcotry
Yancey0623 Apr 28, 2017
6f097a3
update
Yancey0623 Apr 29, 2017
bfdd1a3
trainer use replicaset instead of statefulset
Yancey0623 May 4, 2017
b56e7e7
update
Yancey0623 May 4, 2017
b8e63d9
update
Yancey0623 May 5, 2017
6cbf80d
update
Yancey0623 May 5, 2017
cb39a81
update
Yancey0623 May 6, 2017
063805d
update
Yancey0623 May 6, 2017
e2e6875
update
Yancey0623 May 6, 2017
5ec1deb
update
Yancey0623 May 6, 2017
53b5afa
update
Yancey0623 May 6, 2017
198d0d1
update
Yancey0623 May 7, 2017
838509b
update
Yancey0623 May 9, 2017
259731a
update
Yancey0623 May 11, 2017
a5a0aeb
update
Yancey0623 May 12, 2017
080e633
trainer function
Yancey0623 May 12, 2017
05d6e00
delete specify resource paragraph
Yancey0623 May 12, 2017
8486227
paramter image instead of base_image and runtime_image
Yancey0623 May 12, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
161 changes: 161 additions & 0 deletions doc/design/cluster_train/submit-job.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# Submit a Distributed Training Job

The user can submit a distributed training job with Python code, rather than with a command-line interface.

## Runtime Environment On Kubernetes

For a distributed training job, there is two Docker image called *runtime Docker image* and *base Docker image*. The runtime Docker image is the Docker image that gets scheduled by Kubernetes to run during training. The base Docker image is for building the runtime Docker image.

### Base Docker Image

Usually, the base Docker image is PaddlePaddle product Docker image including paddle binary files and python package. And of course, users can specify any image name hosted on any docker registry which users have the access right.

### Runtime Docker Image

The trainer package which user upload and some Python dependencies are packaged into a runtime Docker image based on base Docker image.

- Handle Python Dependencies

You need to provide requirements.txt file in your `trainer-package` folder. Example:

```txt
pillow
protobuf==3.1.0
```
More [details](https://pip.readthedocs.io/en/1.1/requirements.html) about requirements, an example project looks like:
```bash
paddle_example
|-quick_start
|-trainer.py
|-dataset.py
|-requirements.txt
```

## Submit Distributed Training Job With Python Code
<img src="./src/submit-job-python.png" width="800">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a paragraph to describe the flow in a big picture. The paragraph below goes directly to the detail of paddle.dist_train, the reader needs a big picture to follow the concept.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker build / Docker push应该是箭头那根线上的东西吧,这些都是动作,感觉应该放在线上,而不是图上(这里的图基本都是名词/一个概念)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


- `paddle.job.dist_train()` will call the Job Server API `/v1/packages` to upload the trainer package and save them on CephFS, and then call `/v1/trainer/job` to submit the PaddlePaddle distributed job.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

能不能查一下为什么paddle.job.dist_train()这些都没被正常的语法高亮,显示出来的是这样的:
screen shot 2017-05-05 at 11 40 19 am

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Miss a blank line...Done.

Copy link
Contributor

@helinwang helinwang May 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为了把我们第一个版本说得具体,能不能这样改:change "The first version, we only implement submit the PaddlePaddle job in paddle.job.dist_train()" to "For the first version, we will not prepare the runtime docker image, instead, we will mount the trainer package in a temporary folder into the base docker image. We will not support custom Python dependencies in the first version as well."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- `/v1/trainer/job` will start a building job for preparing the runtime Docker image. When the building job is finished, Job Server will submit the PaddlePaddle distributed job to Kubernetes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

第一版本是通过JobServer来submit job,还是本地Python直接来?

Copy link
Contributor Author

@Yancey0623 Yancey0623 May 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

第一版就本地直接提交了,本地提交需要build runtime Docker image,这里补充了一句。
Done.

- *NOTE*: For the first version, we will not prepare the runtime Docker image on JobServer, instead, we will build the runtime Docker image on our host. If the code is running on PaddleCloud, we will mount the trainer package in a temporary folder into the base Docker image. We will not support custom Python dependencies in the first version as well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"instead, we will build the runtime Docker image on our host"和后面的“we will mount the trainer package in a temporary folder into the base Docker image”有冲突,前面是说要build runtime image,后面说是直接用base image就好。
建议改成

NOTE: For the first version, we will not prepare the runtime Docker image, instead, the package is uploaded to Paddle cloud, and Paddle cloud will mount the package in a temporary folder into the base Docker image. We will not support custom Python dependencies in the first version as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paddle cloud => Paddle Cloud
Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

另外貌似用户不需要指定runtime Docker image, 所以在参数中也只保留了image,去掉了base_imageruntime_image


You can call `paddle.job.dist_train` and provide distributed training configuration as the parameters:
```python
paddle.job.dist_train(
trainer=paddle.trainer.SGD(...,
paddle.updater.Adam(...)),
reader=reader,
paddle_job=PaddleJob(
runtime_image = "yancey1989/paddle-job",
job_name="paddle-job",
namespace="paddle-cloud",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不理解为什么需要namespace,我以为job_name就可以成为namespace了?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在Kubernetes集群上的做法是为每个用户创建一个namespace,一个namespace下会跑多个Job。 不过本地提交可以在~/.kube/config中获取namespace,PaddleCloud也可以通过环境变量来获取,所以这个参数可以去掉了:)
Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

另外我在实现PaddleJob的时候发现dist_train可能只需要trainer, paddle_job两个参数即可,其中tariner是一个的function, PR中做了说明。
Done.

cpu_nums=3,
memory="1G"
trainer_package="/example/word2vec",
entry_point="python %s" % __file__))
```

The pseudo code of `paddle.job.dist_train` is as follows:
```python
def dist_train(trainer, reader, num_passes=1, event_handler=None, feeding=None, paddle_job=None):
# if the code is running on cloud, set PADDLE_ON_CLOUD=YES
if os.getenv("PADDLE_ON_CLOUD", "NO") == "NO":
#submit the paddle job
paddle_job.submit()
else:
#start the training
trainer.train(reader, num_passes, event_handler, feeding)
```
### PaddleJob Parameters
parameter | type | explanation
--- | --- | ---
job_name | str | the unique name for the training job
entry_point | str | entry point for startup trainer process
trainer_package | str | trainer package file path which user have the access right
base_image|str|the [base image](#base-docker-image) for building the [runtime image](#runtime-docker-image)
runtime_image|str| [runtime image](#runtime-docker-image)
memory|str| memory allocated for the job, a plain integer using one of these suffixes: E, P, T, G, M, K
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经有trainer_mem,memroy是不是可以删了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
去掉了第一种配置资源的方式。

cpu_nums|int| CPU count for the job
gpu_nums|int| GPU count for the job
pservers|int| Parameter Server process count
trainers|int| Trainer process count
pserver_cpu|int| CPU count for each Parameter Server process
pserver_mem|str| memory allocated for each Parameter Server process, a plain integer using one of these suffixes: E, P, T, G, M, K
trainer_cpu|int| CPU count for each Trainer process
trainer_mem|str| memory allocated for each Trainer process, a plain integer using one of these suffixes: E, P, T, G, M, K

### Specify Resource for a Distributed Training Job
- Specify Job Resource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里讲了两种方法,我感觉我们只想清楚了第二种,也就准备支持第二种。那么第一种我们先删掉吧:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

You can specify the resource for a job totally used, and the trainer count and pserver count will be calculated according to the job resources and Kubernetes physical architecture.
- you *may* specify `gpu_nums`, GPU count for the job totally used.
- you *must* specify `cpu_nums`, CPU count for the job totally used.
- you *must* specify `memory`, memory allocated for the job.
Example:
```python
paddle_job = PaddleJob(
job_name = "paddle-cloud",
entry_point = "python train.py",
trainer_package = "/example/word2vec",
runtime_image = "yancey1989/paddle-job",
memory = "10G",
cpu_nums = 3,
gpu_nums = 3
)
```
- Specify Trainer and Parameter Server Resource
- you *must* specify `trainers`, Trainer count for the job.
- you *must* specify `pservers`, Parameter Server count for the job.
- you *must* specify `trainer_cpu`, CPU count for each Trainer process.
- you *must* specify `trainer_mem`, memory allocated for each Trainer process.
- you *must* specify `pserver_cpu`, CPU count for each Parameter Server process.
- you *must* specify `pserver_mem`, memory allocated for each Parameter Server process.
- you *may* specify `trainer_gpu`, GPU count for each Trainer process.
Example:
```python
paddle_job = PaddleJob(
job_name = "paddle-cloud",
entry_point = "python train.py",
trainer_package = "/example/word2vec",
runtime_image = "yancey1989/paddle-job",
trainers = 10,
pservers = 3,
trainer_cpu = 1,
trainer_gpu = 1,
trainer_mem = "10G",
pserver_cpu = 1,
pserver_mem = "2G"
)
```

### Deploy Parameter Server, Trainer and Master Process
- Deploy PaddlePaddle Parameter Server processes, it's a Kubernetes ReplicaSet.
- Deploy PaddlePaddle Trainer processes, it's a Kubernetes Job.
- Deploy PaddlePaddle Master processes, it's a Kubernetes ReplicaSet.

## Job Server
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

第一版本需要实现job server吗?https://github.com/PaddlePaddle/cloud/wiki/2017-05 中提到了“查看训练状态”是指网站通过job server的API取得信息,显示给用户看任务列表和任务log吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一版是想在网站后台直接实现JobServer的部分接口,例如查看训练状态和任务列表。


- RESTful API

Job server provides RESTful HTTP API for receiving the trainer package and displaying
PaddlePaddle job related informations.
- `POST /v1/package` receive the trainer package and save them on CephFS
- `POST /v1/trainer/job` submit a trainer job
- `GET /v1/jobs/` list all jobs
- `GET /v1/jobs/<job-name>` the status of a job
- `DELETE /v1/jobs/<job-name>` delete a job
- `GET /v1/version` job server version

- Build Runtime Docker Image on Kubernetes

`paddle.job.dist_train` will upload the trainer package to Job Server, save them on the distributed filesystem, and then start up a job for building the runtime Docker image that gets scheduled by Kubernetes to run during training.

There are some benefits for building runtime Docker image on JobServer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们最后觉得是在JobServer上build image还是在本地build image?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一期因为所有代码都是在云端存储上,所以就不build image了直接copy trainer_package比较简单。根据之前的讨论后续还是在JobServer上来build runtime Docker image感觉比较合适。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- On Paddle Cloud, users will run the trainer code in a Jupyter Notebook which is a Kubernetes Pod, if we want to execute `docker build` in the Pod, we should mount the host's `docker.sock` to the Pod, user's code will connect the host's Docker Engine directly, it's not safe.
- Users only need to upload the training package files, does not need to install docker engine, docker registry as dependencies.
- If we want to change another image type, such as RKT, users do not need to care about it.

- Deploy Parameter Server, Trainer and Master Processes

`POST /v1/trainer/job` receives the distributed training parameters, and deploy the job as follows:
- Deploy PaddlePaddle Parameter Server processes, it's a Kubernetes ReplicaSet.
- Deploy PaddlePaddle Trainer processes, it's a Kubernetes Job.
- Deploy PaddlePaddle Master processes, it's a Kubernetes ReplicaSet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

只有一个master,需要ReplicaSet吗?(或者什么别的更合适一点),我不是特清楚部署的时候用哪个比较好,请教一下。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReplicaSe比较简单了,主要是为了当Master节点挂掉时,Kubernetes会重新将Master调度起来,用Pod将会失去这个特性。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,明白了!