Skip to content
Closed
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
3331bda
submit job
Yancey0623 Apr 11, 2017
7faa331
small png
Yancey0623 Apr 11, 2017
718f901
update
Yancey0623 Apr 15, 2017
fd9e1c2
add paddlepaddle commands
Yancey0623 Apr 15, 2017
c095a93
update png
Yancey0623 Apr 15, 2017
d74d9ba
udpate png
Yancey0623 Apr 15, 2017
f4c7bd2
adjust sytle
Yancey0623 Apr 15, 2017
bb7263f
update submit-job
Yancey0623 Apr 18, 2017
005c3e1
update
Yancey0623 Apr 18, 2017
0f113d3
update
Yancey0623 Apr 20, 2017
b6969f9
update paddle server
Yancey0623 Apr 21, 2017
1771707
update
Yancey0623 Apr 25, 2017
a21743a
update
Yancey0623 Apr 25, 2017
d643295
resize image
Yancey0623 Apr 25, 2017
5827dc1
update
Yancey0623 Apr 25, 2017
02d18b2
update
Yancey0623 Apr 27, 2017
68ff895
update
Yancey0623 Apr 27, 2017
1fd8900
udpate image location
Yancey0623 Apr 27, 2017
a57bb04
update
Yancey0623 Apr 28, 2017
1987b45
rename direcotry
Yancey0623 Apr 28, 2017
6f097a3
update
Yancey0623 Apr 29, 2017
bfdd1a3
trainer use replicaset instead of statefulset
Yancey0623 May 4, 2017
b56e7e7
update
Yancey0623 May 4, 2017
b8e63d9
update
Yancey0623 May 5, 2017
6cbf80d
update
Yancey0623 May 5, 2017
cb39a81
update
Yancey0623 May 6, 2017
063805d
update
Yancey0623 May 6, 2017
e2e6875
update
Yancey0623 May 6, 2017
5ec1deb
update
Yancey0623 May 6, 2017
53b5afa
update
Yancey0623 May 6, 2017
198d0d1
update
Yancey0623 May 7, 2017
838509b
update
Yancey0623 May 9, 2017
259731a
update
Yancey0623 May 11, 2017
a5a0aeb
update
Yancey0623 May 12, 2017
080e633
trainer function
Yancey0623 May 12, 2017
05d6e00
delete specify resource paragraph
Yancey0623 May 12, 2017
8486227
paramter image instead of base_image and runtime_image
Yancey0623 May 12, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
111 changes: 111 additions & 0 deletions doc/design/cluster_train/submit-job.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Submit a Distributed Training Job

If a user wants to start up a distributed training job, he will submit the distributed training job with Python code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文章题目是"Submit a Distributed Training Job",是不是本地训练就不用说了(或者不用详细说,这里本地训练的介绍字数比远程训练的介绍字数还多)。
是不是可以改成:
The user can submit a distributed training job with Python code, rather than with a command-line interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


If a user wants to start up a local train, he will start up a PaddlePaddle production Docker container firstly, and then
execute `python train.py` in the Docker container. The details about PaddlePaddle Docker image is [here](../../../paddle/scripts/docker/README.md).

## Runtime Environment On Kubernetes

For a distributed training job, there is two Docker image called *runtime Docker image* and *base Docker image*. The runtime Docker image is the Docker image that gets scheduled by Kubernetes to run during training. The base Docker image is for building the runtime Docker image.

- Base Docker Image

Usually, the base Docker image is PaddlePaddle product Docker image including paddle binary files and trainer startup script file. And of course, users can specify any image name hosted on any docker registry which users have the access right.

- Runtime Docker Image

The trainer package which user upload and some Python dependencies are packaged into a runtime Docker image based on base Docker image
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add "."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


- Python Dependencies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里Python Dependencies跟Base / Runtime Docker image不是并列关系,是不是可以改成一个小标题

### Handle Python Dependencies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, 是不是用-更合适一些?小标题的话要#### Handler Python Dependencies,层级太深了。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yancey1989 主要是旁边有两条-开头的是并列关系,这里再加个-开头的感觉不太合适。其他情况我觉得都可以哈。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我看到更新之后的了,可以的~


You need to provide requirments.txt file in your "trainer" package. Example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> requirements.txt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

```txt
pillow
protobuf==3.1.0
```
More [details](https://pip.readthedocs.io/en/1.1/requirements.html) about requirements.

An example project looks like:
```bash
paddle_example
|-quick_start
|-trainer.py
|-dataset.py
|-requirments.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> requirements.txt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

```

## Submit Distributed Training Job With Python Code
<img src="./src/submit-job-python.png" width="800">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a paragraph to describe the flow in a big picture. The paragraph below goes directly to the detail of paddle.dist_train, the reader needs a big picture to follow the concept.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker build / Docker push应该是箭头那根线上的东西吧,这些都是动作,感觉应该放在线上,而不是图上(这里的图基本都是名词/一个概念)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


You can call `paddle.dist_train` and provide distributed training configuration as the parameters.
```python
paddle.dist_train(
trainer=paddle.trainer.SGD(...,
paddle.updater.Adam(...)),
reader=reader,
paddle_job=PaddleJob(
job_name="quickstart",
pservers=4,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pservers=4但cpu=1好像不是很合理。。。cpu或是gpu=1的情况是不用pserver的。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不使用pserver的情况应该属于在集群进行的单机训练?感觉这是不是需要一个单独的API来做这个事情呢?因为即使CPU=10也有可能用户只想做CPU=10的单机训练而已。。另外我建了一个issue: #2019 讨论下如何指定资源,感觉pservers也可以不要了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done,增加了对资源使用的描述。

volume="quickstart",
input=/quickstart/input,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我理解的是咱们会自动把用户根目录mount进去在指定的位置(比如/home/),所以这里input和output都不需要了吧?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

output=/quickstart/output,
base_image="paddlepaddle/paddle:0.10.rc2",
use_gpu=False,
memory="512M")
)
```

The pseudo code of `paddle.dist_train` is as follows:
```python
def dist_train(trainer, reader, num_passes=1, event_handler=None, feeding=None, paddle_job=None):
if os.getenv("PADDLE_NOTEBOOK", "NO") == "YES":
#submit the paddle job
paddle_job.submit()
else:
#start the training
trainer.train(reader, num_passes, event_handler, feeding)
```

parameter | required | default | explain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain -> explanation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

--- | --- | --- | ---
job_name|YES||you should special a uniq job name which in a namespace
trainer_package|YES|| entry point for startup trainer process
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trainer_package是指的上传哪个文件夹吗?这个也不是entrypoint吧。我理解entrypoint是一个指令。
我想象中的是这样的:

(..., trainer_package="/path/to/folder", entrypoint="python train.py")

其实我不确定需不需要,是不是可以trainer_package用当前Python文件的目录,entrypoint用python trainer Python文件就行。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不好意思,这里少了一行。因为Kubernetes的Pod要能访问到trainer_package所指向的目录,所以trainer_package应该是CephFS或者Docker image里的一个目录,用当前Python文件的目录应该是不行的。

entrypoint可以直接是"python trainer %s" % __file__

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

input| YES || input directory on distributed file system
output|YES|| output directory on distributed file system
pservers|YES|| parameter server process count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paservers还需要吗,上面代码的例子里没看见。

Copy link
Contributor Author

@Yancey0623 Yancey0623 May 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以是个可选性,补充了Resource的部分。
Done

base-image|YES||PaddlePaddle production Docker image
memory|YES|| limits for memory
use_gpu|NO|false| whether use GPU
cpu_num|NO|1| if `use_gpu=false`, this parameter is required
gpu_num|NO|1| if `use_gpu=true`, this parameter is required

- Startup Parameter Server and Trainer Jobs
- Deploy parameter server job, it's a Kubernetes ReplicaSet.
- Deploy trainer job, it's a Kubernetes Job.

# Job Server

- RESTful API

Job server provides a RESTful HTTP server receives the trainer packages, list PaddlePaddle job etc...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“list PaddlePaddle job etc...”要不要改成一个概括性的比如说:"display job related informations"。写"etc..."有点含糊,感觉不适合在design doc里出现。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- `POST /v1/package` receive the trainer package and save them on CephFS
- `POST /v1/trainer/job` submit a trainer job
- `GET /v1/jobs/` list all jobs
- `GET /v1/jobs/<job-name>` the status of a job
- `DELETE /v1/jobs/<job-name>` delete a job
- `GET /v1/version` job server version

- Build Runtime Docker Image on Kubernetes

`paddle.dist_train` will upload the trainer package to Job Server and then save them on the distributed filesystem, and then start up a job for building the runtime Docker image, Parameter Server and Trainer will use this runtime Docker image.

There are some benefits for building runtime Docker image on JobServer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们最后觉得是在JobServer上build image还是在本地build image?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一期因为所有代码都是在云端存储上,所以就不build image了直接copy trainer_package比较简单。根据之前的讨论后续还是在JobServer上来build runtime Docker image感觉比较合适。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- **Docker in Docker** should mount `docker.sock` in the container and set `--privileged`, if the code running in a kubernetes pod, it's not safe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker in Docker加粗不是很合适:读者可能都不明白什么是Docker in Docker。
另外,我没有明白用Docker in Docker在k8s里不safe,为什么在JobServer就safe了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果代码在Paddle Cloud上运行,每个Jupyter Notebook 是一个Kubernetes的Pod,如果需要在这个Pod里去做Docker build,那么就需要将主机上的docker.sock mount到这个Pod里,那么用户可以在Notebook里写代码调用Docker的 REST API来访问本机的Docker Engine了。
而在JobServer里做Docker build比较安全的原因是在JobServer中会通过我们写好的一段bash,只做Docker build的事情,不会直接执行用户的代码。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- Users only need to upload the training package files, does not dependency docker engine, docker registry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not dependency docker engine, docker registry. -> does not need to install docker engine, docker registry as dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- If we want to change another image type, such as RKT, the user does not need to care about it.

- Start Up Parameter Server and Trainer Jobs
`POST /v1/trainer/job` receives the distributed trainning parameters, and deploy the job as follows:
- Deploy pserver job, it's a Kubernetes ReplicaSet.
- Deploy trainer job, it's a Kubernetes Job.