Skip to content
Closed
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
3331bda
submit job
Yancey0623 Apr 11, 2017
7faa331
small png
Yancey0623 Apr 11, 2017
718f901
update
Yancey0623 Apr 15, 2017
fd9e1c2
add paddlepaddle commands
Yancey0623 Apr 15, 2017
c095a93
update png
Yancey0623 Apr 15, 2017
d74d9ba
udpate png
Yancey0623 Apr 15, 2017
f4c7bd2
adjust sytle
Yancey0623 Apr 15, 2017
bb7263f
update submit-job
Yancey0623 Apr 18, 2017
005c3e1
update
Yancey0623 Apr 18, 2017
0f113d3
update
Yancey0623 Apr 20, 2017
b6969f9
update paddle server
Yancey0623 Apr 21, 2017
1771707
update
Yancey0623 Apr 25, 2017
a21743a
update
Yancey0623 Apr 25, 2017
d643295
resize image
Yancey0623 Apr 25, 2017
5827dc1
update
Yancey0623 Apr 25, 2017
02d18b2
update
Yancey0623 Apr 27, 2017
68ff895
update
Yancey0623 Apr 27, 2017
1fd8900
udpate image location
Yancey0623 Apr 27, 2017
a57bb04
update
Yancey0623 Apr 28, 2017
1987b45
rename direcotry
Yancey0623 Apr 28, 2017
6f097a3
update
Yancey0623 Apr 29, 2017
bfdd1a3
trainer use replicaset instead of statefulset
Yancey0623 May 4, 2017
b56e7e7
update
Yancey0623 May 4, 2017
b8e63d9
update
Yancey0623 May 5, 2017
6cbf80d
update
Yancey0623 May 5, 2017
cb39a81
update
Yancey0623 May 6, 2017
063805d
update
Yancey0623 May 6, 2017
e2e6875
update
Yancey0623 May 6, 2017
5ec1deb
update
Yancey0623 May 6, 2017
53b5afa
update
Yancey0623 May 6, 2017
198d0d1
update
Yancey0623 May 7, 2017
838509b
update
Yancey0623 May 9, 2017
259731a
update
Yancey0623 May 11, 2017
a5a0aeb
update
Yancey0623 May 12, 2017
080e633
trainer function
Yancey0623 May 12, 2017
05d6e00
delete specify resource paragraph
Yancey0623 May 12, 2017
8486227
paramter image instead of base_image and runtime_image
Yancey0623 May 12, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
111 changes: 111 additions & 0 deletions doc/design/cluster_train/submit-job.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Submit a Distributed Training Job

If a user wants to start up a distributed training job, he will submit the distributed training job with python code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python需要大写首字母。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


If a user wants to start up a local train, he will start up a PaddlePaddle production Docker container firstly, and then
execute `python train.py` in the Docker container.The details about PaddlePaddle Docker image is [here](../../../paddle/scripts/docker/README.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.后面少了个空格。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

句子结尾要有句号"."。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


## Runtime Environment On Kubernetes

For a distributed training job, there is two Docker image called **runtime Docker image** and **base Docker image**. The runtime Docker image is the Docker image that gets scheduled by Kubernetes to run during training. The base Docker image is for building the runtime Docker image.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对不起,之前说错了,我评论里写的是斜体,给的markdown其实是粗体。新的名词应该用斜体:

*runtime Docker image*
*base Docker image*

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


- Base Docker Image

Usually, the base Docker image is PaddlePaddle product Docker image including paddle binary files and trainer startup script file. And of course, users can specify any image name hosted on any docker registry which users have the right access.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have the right access -> have the access right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


- Runtime Docker Image

Package the trainer package which user upload and some python dependencies into a runtime Docker image base on base docker image, this is done automatically by Job Server.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好像缺了主语,是不是改成:

Package the trainer package which user upload and some python dependencies into a runtime Docker image base on base docker image -> The trainer package which user upload and some python dependencies are packaged into a runtime Docker image based on base docker image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


- Python Dependencies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里Python Dependencies跟Base / Runtime Docker image不是并列关系,是不是可以改成一个小标题

### Handle Python Dependencies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, 是不是用-更合适一些?小标题的话要#### Handler Python Dependencies,层级太深了。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yancey1989 主要是旁边有两条-开头的是并列关系,这里再加个-开头的感觉不太合适。其他情况我觉得都可以哈。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我看到更新之后的了,可以的~


You need to provide requirments.txt file in your "trainer" package. Example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> requirements.txt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

```txt
pillow
protobuf==3.1.0
```
More [details](https://pip.readthedocs.io/en/1.1/requirements.html) about requirements.

An example project looks like:
```bash
paddle_example
|-quick_start
|-trainer.py
|-dataset.py
|-requirments.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> requirements.txt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

```

## Submit Distributed Training Job With Python Code
<img src="./src/submit-job-python.png" width="800">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a paragraph to describe the flow in a big picture. The paragraph below goes directly to the detail of paddle.dist_train, the reader needs a big picture to follow the concept.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker build / Docker push应该是箭头那根线上的东西吧,这些都是动作,感觉应该放在线上,而不是图上(这里的图基本都是名词/一个概念)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


You can call `paddle.dist_train` and provide distributed training configuration as the parameters.
```python
paddle.dist_train(
trainer=paddle.trainer.SGD(...,
paddle.updater.Adam(...)),
reader=reader,
paddle_job=PaddleJob(
job_name="quickstart",
pservers=4,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pservers=4但cpu=1好像不是很合理。。。cpu或是gpu=1的情况是不用pserver的。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不使用pserver的情况应该属于在集群进行的单机训练?感觉这是不是需要一个单独的API来做这个事情呢?因为即使CPU=10也有可能用户只想做CPU=10的单机训练而已。。另外我建了一个issue: #2019 讨论下如何指定资源,感觉pservers也可以不要了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done,增加了对资源使用的描述。

volume="quickstart",
input=/quickstart/input,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我理解的是咱们会自动把用户根目录mount进去在指定的位置(比如/home/),所以这里input和output都不需要了吧?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

output=/quickstart/output,
base_image="paddlepaddle/paddle:0.10.rc2",
use_gpu=False,
memory="512M")
)
```

The pseudo code of `paddle.dist_train` is as follows:
```python
def dist_train(trainer, reader, num_passes=1, event_handler=None, feeding=None, paddle_job=None):
if os.getenv("PADDLE_NOTEBOOK", "NO") == "YES":
#submit the paddle job
paddle_job.submit()
else:
#start the training
trainer.train(reader, num_passes, event_handler, feeding)
```

parameter | required | default | explain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain -> explanation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

--- | --- | --- | ---
job_name|YES||you should special a uniq job name which in a namespace
trainer_package|YES|| entry point for startup trainer process
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trainer_package是指的上传哪个文件夹吗?这个也不是entrypoint吧。我理解entrypoint是一个指令。
我想象中的是这样的:

(..., trainer_package="/path/to/folder", entrypoint="python train.py")

其实我不确定需不需要,是不是可以trainer_package用当前Python文件的目录,entrypoint用python trainer Python文件就行。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不好意思,这里少了一行。因为Kubernetes的Pod要能访问到trainer_package所指向的目录,所以trainer_package应该是CephFS或者Docker image里的一个目录,用当前Python文件的目录应该是不行的。

entrypoint可以直接是"python trainer %s" % __file__

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

input| YES || input directory on distributed file system
output|YES|| output directory on distributed file system
pservers|YES|| parameter server process count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paservers还需要吗,上面代码的例子里没看见。

Copy link
Contributor Author

@Yancey0623 Yancey0623 May 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以是个可选性,补充了Resource的部分。
Done

base-image|YES||PaddlePaddle production Docker image
memory|YES|| limits for memory
use_gpu|NO|false| whether use GPU
cpu_num|NO|1| if `use_gpu=false`, this parameter is required
gpu_num|NO|1| if `use_gpu=true`, this parameter is required

- Startup Parameter Server and Trainer Jobs
- Deploy parameter server job, it's a Kubernetes StatefulSet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请问Parameter server用ReplicaSet能行吗,为什么需要StatefulSet?感觉要是简单的方法能用的话,就不要用复杂的方法吧。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用StatefulSet的好处是,当有一个Parameter Server的Pod挂掉,Kubernetes新启动的Pod会和之Pod的hostname保持一致,这样trainer就不需要再去获取新的Parameter Server Pod的地址了。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在design doc里面写了如何进行service discovery,是不要求hostname或者ip一致的。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, 因为service discover使用了IP地址,所以保持hostname不变也没有必要了,使用简单的ReplicaSet替换StatefulSet。

- Deploy trainer job, it's a Kubernetes Job.

# Job Server

- RESTful API

Job server provides a RESTful HTTP server receives the trainer packages, list PaddlePaddle job etc...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“list PaddlePaddle job etc...”要不要改成一个概括性的比如说:"display job related informations"。写"etc..."有点含糊,感觉不适合在design doc里出现。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- `POST /v1/package` receive the trainer package and save them on GlustereFS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GlustereFS -> CephFS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- `POST /v1/trainer/job` submit a trainer job
- `GET /v1/jobs/` list all job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list all job -> list all jobs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- `GET /v1/jobs/<job-name>` the status of a job
- `DELETE /v1/jobs/<job-name>` cancel a job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cancel的job对应的log需不需要仍然能让用户看到。如果需要的话,貌似Cancel就跟Delete不一样了,Delete是删掉,Cancel是把状态改到取消。不知道这种情况用HTTP DELTE还合不合适。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉前期可以只考虑Delete Job了,可以将日志存储在GlusterFS或者通过Logstash写在Elasticsearch中保存一段时间。

- `GET /v1/version` job server version

- Build Runtime Docker Image on Kubernetes

`paddle.dist_train` will upload the trainer package to Job Server and then save them on the distributed filesystem, and then start up a job for building the runtime Docker image, Parameter Server and Trainer will use this runtime Docker image.

There are some benefits for building runtime Docker image on JobServer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们最后觉得是在JobServer上build image还是在本地build image?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一期因为所有代码都是在云端存储上,所以就不build image了直接copy trainer_package比较简单。根据之前的讨论后续还是在JobServer上来build runtime Docker image感觉比较合适。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- **Docker in Docker** should mount `docker.sock` in the container and set `--privileged`, if the code running in a kubernetes pod, it's not safety.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not safety -> it's not safe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- Users only need to upload the training package files, does not dependency docker engine, docker registry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not dependency docker engine, docker registry. -> does not need to install docker engine, docker registry as dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- If we want to change another image type, such as RKT, the user does not need to care about it.

- Start Up Parameter Server and Trainer Jobs
`POST /v1/trainer/job` receives the distributed trainning parameters, and deploy the job as follows:
- Deploy pserver job, it's a Kubernetes StatefulSet.
- Deploy trainer job, it's a Kubernetes Job.