Skip to content
Closed
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
3331bda
submit job
Yancey0623 Apr 11, 2017
7faa331
small png
Yancey0623 Apr 11, 2017
718f901
update
Yancey0623 Apr 15, 2017
fd9e1c2
add paddlepaddle commands
Yancey0623 Apr 15, 2017
c095a93
update png
Yancey0623 Apr 15, 2017
d74d9ba
udpate png
Yancey0623 Apr 15, 2017
f4c7bd2
adjust sytle
Yancey0623 Apr 15, 2017
bb7263f
update submit-job
Yancey0623 Apr 18, 2017
005c3e1
update
Yancey0623 Apr 18, 2017
0f113d3
update
Yancey0623 Apr 20, 2017
b6969f9
update paddle server
Yancey0623 Apr 21, 2017
1771707
update
Yancey0623 Apr 25, 2017
a21743a
update
Yancey0623 Apr 25, 2017
d643295
resize image
Yancey0623 Apr 25, 2017
5827dc1
update
Yancey0623 Apr 25, 2017
02d18b2
update
Yancey0623 Apr 27, 2017
68ff895
update
Yancey0623 Apr 27, 2017
1fd8900
udpate image location
Yancey0623 Apr 27, 2017
a57bb04
update
Yancey0623 Apr 28, 2017
1987b45
rename direcotry
Yancey0623 Apr 28, 2017
6f097a3
update
Yancey0623 Apr 29, 2017
bfdd1a3
trainer use replicaset instead of statefulset
Yancey0623 May 4, 2017
b56e7e7
update
Yancey0623 May 4, 2017
b8e63d9
update
Yancey0623 May 5, 2017
6cbf80d
update
Yancey0623 May 5, 2017
cb39a81
update
Yancey0623 May 6, 2017
063805d
update
Yancey0623 May 6, 2017
e2e6875
update
Yancey0623 May 6, 2017
5ec1deb
update
Yancey0623 May 6, 2017
53b5afa
update
Yancey0623 May 6, 2017
198d0d1
update
Yancey0623 May 7, 2017
838509b
update
Yancey0623 May 9, 2017
259731a
update
Yancey0623 May 11, 2017
a5a0aeb
update
Yancey0623 May 12, 2017
080e633
trainer function
Yancey0623 May 12, 2017
05d6e00
delete specify resource paragraph
Yancey0623 May 12, 2017
8486227
paramter image instead of base_image and runtime_image
Yancey0623 May 12, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
161 changes: 161 additions & 0 deletions doc/design/dist/submit-job.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@

# PaddlePaddle Client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are describing the PaddlePaddle client, do you need to describe the full features of it, like local training, show version etc.

PaddlePaddle client is command line interface for running PaddlePaddle distributed training job or start up a local training job.

The relation of PaddlePaddle, kubernetes and docker:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relation -> relationship

下面一行就直接是一级标题了?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, 改成二级标题了,多谢指正。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PaddlePaddle,Kubernetes and docker -> PaddlePaddle, kubernetes, and Docker

Kubernetes, Docker专有名词首字母需要大写,grammarly查语法问题很方便。


<img src="./submit-job.png" width="500">
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions with this figure:

  1. I am not sure that pservers and trainers should be in two jobs. In our current configuration, each trainer has a pserver running on the same physical node to optimally overlay networking and computing.

  2. paddle must communicate with Kubernetes' API server to start the job. It might communicate to the master process of the job too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I am not sure that pservers and trainers should be in two jobs

现在的配置会有两个问题:

  1. 不利于trainer的大规模训练,由于trainer count==pserver count,当trainer数很大时,我们并不需要同样数量的pserver数,因为这会增加pserver失败的概率以及增加网络负载。
  2. 由于启在同一个container里,pserver和trainer有一个会以后台方式启动,不符合container的设计原则,不利于故障检测和恢复。

each trainer has a pserver running on the same physical node to optimally overlay networking and computing

看起来控制pserver的数量更可以达到优化网络的效果,而且trainer需要和所有的pserver通信,所以只有一个pserver 节点启在本地看起来效果也不是很大?

paddle must communicate with Kubernetes' API server to start the job. It might communicate to the master process of the job too

我觉得 @helinwang 的这个comment是有道理的,我paddle可以只和master进行通信,master作为一个service存在,并且这个master和 https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/dist#master-process 这里的master 并不是同一个master, 我会在下一版的更新中修改这部分的描述。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

图片里的"start up a local training job"需要PaddlePaddle Client来做吗?感觉直接本地python train.py就行了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在本地训练需要docker run ... python train.py才可以,设计PaddlePaddle Client支持本地训练也是为了让用户不必学习Docker相关的操作。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

明白了,那这样用户还需要在下载docker image之外,另外下载一个运行脚本吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们线下讨论下这个问题吧,还有Queue的实现是否使用etcd也需要一起讨论下:)



# Running local training job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local training job感觉直接python train.py就行了,不需要命令行支持。

You can execute the command: `paddle train` with flag `--locally` to start up a local train.
```bash
paddle train \
--locally \
--job-name=quickstart \
--package-path=./demo \
--module=demo.train \
--input=<input_dir> \
--output=<output_dir> \
--image=<paddle_image> \
--env=NUM_PASS=4
```
- `job-name`: your local training job name
- `package-path`: your trainer code python package
- `module`: include a main function, trainer entrance.
- `input`: input directory, for local train, it's a host path.
- `output`: output directory, for local train, it's a host path.
- `base-image`: paddlepaddle production image
- `env`: environment varible

When users start a local training job, PaddlePaddle client starts a docker container like:
```bash
docker run --rm \
--name quickstart \
-v <host input dir>:/input \
-v <host output dir>:/output \
-v <package files>:/package \
-e NUM_PASS=4 \
paddlepaddle/paddle:0.10.0rc3 \
-e PYTHONPATH=/package \
python /package/train.py
```


# Running distributed training job

## Configurate PaddlePaddle client

You should configure PaddlePaddle client by the configuration file firstly, the default path:
`$HOME/.paddle/config`.

```yaml
apiVersion: v1
dockerRegistry:
domain: domain.com //default is docker.io
username: <username>
password: <password>
paddleServer: http://<paddle server domain>:<paddle server port>
```


## Submit a distributed training job
Users will submit a distributed training job with the command: `paddle train` without flag `--locally`.

```bash
paddle train \
--job-name=cluster-quickstart \
--package-path=$PWD/quick_start \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是说要上传一个文件夹上去,作为执行时候的根目录,跟我所理解的pacakge貌似不是一个东西。感觉叫--env-path更容易理解。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

env-path看起来更像是设置一个环境变量? 这个package-path是包含了network configuration 的一个本地的目录, @helinwang 理解的package是指什么呢?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我把package理解成了编程语言里的一个包,在我的脑海里跟文件夹不一样(比如C++的包是头文件+动态库/静态库,不是一个文件夹)。

--module=quick_start.train \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"--module=quick_start.train"建议改成:"--entry-point=python train.py" #也可以是一个用户写的脚本,没必要现制成只能从python的模块执行。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with --entry-point instead of --module

--input=<input-dir> \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

按照昨晚关于数据集和reader的讨论,--input貌似不需要了。请参考:#1696 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我觉得--input参数还是需要的,在分布式的训练中,--input指向一个GlusterFS的路径,我们提交数据,提交任务的过程如下

  1. paddle upload <host path> <clusterfs path>
  2. paddle train --input=<clusterfs path>
  3. 在描述启动Trainer的Job时,需要将GlusterFS的Volume mount到Pod中,例如https://github.com/k8sp/tutorials/blob/develop/quickstart/paddle_dist_train/quickstart.yaml#L57
  4. 在reader中:
trainerid = fetch_trainerid_from_todo_queue()
fp = open(os.path.join("/mnt/glusterfs", os.getenv("INPUT"), trainerid, ".list"))
def reader():
    for l in fp:
        yield ...
return reader

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yancey1989 武毅更改了一下他的上传数据及reader使用的PR,请看这里:https://github.com/typhoonzero/Paddle/blob/clusterdesign/doc/design/cluster_train/data_dispatch.md

--output=<output-dir> \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

输出路径要不要由paddle server来自动指定?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

最好还是由用户指定输出路径,会有一些特定的业务含义,方便识别。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“特定的业务含义”能举个例子吗?
如果让用户指定路径,我们还要考虑怎么handle路径出错问题,会不会太复杂?
如果我们默认就输出在userfolder/job-UUID/会不会更简单。

--trainers=4 \
--pservers=2 \
--base-image:<paddle-k8s-image> \
--use-gpu=1 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我的理解正确的话,这里应该是--use-gpu=true?--use-gpu=1有点歧义了:“1”是true的意思,还是几台gpu的意思。

看到后面的documentation了,感觉要不还是改一下?能让用户不看文档就理解清楚的话感觉更好。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--use-gpu=true is great, thanks!

--gpu-num=1 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--gpu-num应该跟trainers重复了吧。(咱们可以先只考虑一个trainer pod一个GPU的情况。)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看起来trains是实际启动的pod数量,--gpu-num是每个trainer进程的thread_num,可以表示每个trainer进程实际使用的GPU卡,实际在Kubernetes测试的时候这也是生效的,看起来实现起来并不是很复杂?
另外--gpu-num这个参数名确实有些不太合适,改为--trainer-gpu-num怎么样?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在群里讨论过了,在这里再写一遍方便其他人浏览:
每个trainer运行几个GPU应该是Kubernetes调度自己来实现的,用户无感知才对。

--env="NUM_PASS=5"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

既然已经可以指定python文件了,还需要能够指定环境变量的能力吗?(我也没想清,只是提出来)。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以保留这个接口吧。

```

- `job-name`: you should specify a unique job name
- `package-path`: python package files on your host
- `module`: include the main function, trainer entrance
- `input`: input directory on distributed file system
- `output`: output directory on distributed file system
- `trainers`: trainer process count
- `pserver`: parameter process count
- `base-image`: your trainer docker image, include your trainer files and dependencies.
- `use-gpu`: whether it is a GPU train
- `gpu-num`: how much GPU card for one paddle trainer process, it's requirements only if `use_gpu=1`,
- `env`: environment variable

## Package into a docker image

- `Runtime docker image` and `base docker image`

For a distributed training job, there is two docker image called `runtime docker image` and `base docker image`, the `runtime docker image` is actually running in kubernetes.

- `runtime docker image` include user's package files and all dependencies.
- `base docker image` usually is PaddlePaddle product docker image including paddle binary files and some scripts used for starting up the trainer process and fetch some information of pod. And of course, users can also build their own paddle binary files into the custom `base docker image` with [this doc](../../../paddle/scripts/docker/README.md).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base image我理解就是paddlepaddle发布的tag名字,不包含“and some scripts used for starting up the trainer process and fetch some information of pod”?

users can also build their own paddle binary files into the custom base docker image:是不是可以改成:users can specify any image name hosted on any public docker registry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public docker registry => docker registry which user have the right access.? 因为Kubernetes是可以上传一个secret来登录private docker registry的。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,我疏忽了,赞!


`runtime docker image` will be built by PaddlePaddle client automatic, here is a simple example project:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

automatic -> automaticly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

```bash
paddle_example
|-quick_start
|-trainer.py
`-dataset.py
requirments.txt
```
- `requirements.txt` list python dependencies package, you can create is like:

```txt
pillow
protobuf==3.1.0
```
some other details is [here](https://pip.readthedocs.io/en/1.1/requirements.html).
- `quick_start` directory include the trainer package files.

Execute the command: `paddle train...`, PaddlePaddle client will upload the trainer package files and setup parameters to [Paddle Server](#paddle-server).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddle train... -> paddle train --package-path ./paddle_example/quick_start (or: paddle train --env-path ./paddle_example/quick_start)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


## Paddle Server
Paddle server is running on kubernetes, users will configure the server address in [PaddlePaddle client configuration file](#configurate-paddlepaddle-client)

- HTTP server
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTTP server -> Paddle Server

Btw, title need to have title case: HTTP Server, not HTTP server. You can convert a string to title case at: http://www.titlecase.com/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!! Done.


Paddle server is an HTTP server, Receiver the trainer files and saves them on GlustereFS.

- Startup pservers and trainers

Paddle Service will deploy pserver and trainer on kubernetes, they are also job resource, naming `<job-name>-pserver` and `<job-name>-trainer`.
- Setup pserver job.
- Setup trainer job, trainer process will be setup until the status of all pserver pod becoming `RUNNING` and fetch all pserver's IP as trainer setup parameters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is required, according to design doc https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/dist . If one thing is not required, we should in general not do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we can implement service discovery by kubernetes API, start up steps of PServer and Trainer as follows:

  • Start Up PSrvers and Trainers Job
    • Deploy pserver job, it's a kubernetes StatefulSet.
    • Deploy trainer job, it's a kubernetes Job.
      • Waiting for all pserver pod is running.
      • Fetch all pserver address using kubernetes API and put environment variables: export PSERVER_ADDR=10.17.0.1:7651,10.17.0.2:7651
      • Start up trainer process with entry-point.

We should have a bash script for start up trainer process:

# fetch all address of pserver
/usr/bin/paddle_k8s sync_wait_pserver
export PSERVER_ADDR=/usr/bin/paddle_k8s fetch_pserver_addr
${entry-point}

trainer.py:

if os.getenv("DISTRIBUTED_TRAIN"):
    paddle.init(pserver=os.getenv("PSERVER_ADDR"), ...)

Copy link
Contributor

@helinwang helinwang Apr 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这样做主要的问题貌似是如果一个pserver挂了,重启之后ip可能会变。好像k8s能够让下一次启动的pod同一个ip?有没有什么不好的地方?
另外,要是这样做,parameter server扩容怎么支持?虽然v1不需要parameter server动态扩容,但是以后是需要的(如果我们要支持trainer数目扩容的话):trainer数目增大很多之后,其实是需要相应的变化pserver数目的,不然parameter server太少导致网络瓶颈。

- Setup trainer process in trainer pod.

## Data source
- Distributed file system

You can upload your training data to distributed file system, such as GlustereFS,
PaddlePaddle support a default reader for reading data from distributed file system.
- HTTP server

TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's already mentioned in #1696 , maybe we can give a general introduction (as you already did) and reference there after that PR is merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this section :)

- Real-time data

TODO

## PaddlePaddle client commands:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think running in client just let user do python train.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To run local training job with flag `--locally` and distributed training job without it.
- `paddle train`: start a training job
- `paddle prediction`: start a prediction job
- `paddle list`: list all PaddlePaddle jobs in current namespace
- `paddle cancel`: cancel a running job.
- `paddle status`: status of a PaddlePaddle job
- `paddle version`: show PaddlePaddle client and PaddlePaddle server version info.

## Work feature
- V1
- Paddle server is a local version, build `runtime docker image`, deploy trainer and pserver job on user's host.
- implement `paddle train`, `paddle list`, `paddle cancel`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we let user do python train.py for running on local, we can change paddle train to something like paddle submit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we still need paddle train for a local train, as https://github.com/PaddlePaddle/Paddle/pull/1770/files#r112356575

- V2
- Paddle server is running on kubernetes, users will only upload the package files and some setup parameters and building `runtime docker image` on kubernetes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please list the benefit of building docker image on the cloud. Otherwise it's not convincing that we should do this step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- implement `paddle prediction` and other feature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is too far away, and too much uncertainty: do we need to do load balancing for paddle prediction, etc... Maybe we can just remove it and add it later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete paddle prediction, done.

Binary file added doc/design/dist/submit-job.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.