-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Design doc: submit a distributed job #1770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 33 commits
3331bda
7faa331
718f901
fd9e1c2
c095a93
d74d9ba
f4c7bd2
bb7263f
005c3e1
0f113d3
b6969f9
1771707
a21743a
d643295
5827dc1
02d18b2
68ff895
1fd8900
a57bb04
1987b45
6f097a3
bfdd1a3
b56e7e7
b8e63d9
6cbf80d
cb39a81
063805d
e2e6875
5ec1deb
53b5afa
198d0d1
838509b
259731a
a5a0aeb
080e633
05d6e00
8486227
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,161 @@ | ||
| # Submit a Distributed Training Job | ||
|
|
||
| The user can submit a distributed training job with Python code, rather than with a command-line interface. | ||
|
|
||
| ## Runtime Environment On Kubernetes | ||
|
|
||
| For a distributed training job, there is two Docker image called *runtime Docker image* and *base Docker image*. The runtime Docker image is the Docker image that gets scheduled by Kubernetes to run during training. The base Docker image is for building the runtime Docker image. | ||
|
|
||
| ### Base Docker Image | ||
|
|
||
| Usually, the base Docker image is PaddlePaddle product Docker image including paddle binary files and python package. And of course, users can specify any image name hosted on any docker registry which users have the access right. | ||
|
|
||
| ### Runtime Docker Image | ||
|
|
||
| The trainer package which user upload and some Python dependencies are packaged into a runtime Docker image based on base Docker image. | ||
|
|
||
| - Handle Python Dependencies | ||
|
|
||
| You need to provide requirements.txt file in your `trainer-package` folder. Example: | ||
|
|
||
| ```txt | ||
| pillow | ||
| protobuf==3.1.0 | ||
| ``` | ||
| More [details](https://pip.readthedocs.io/en/1.1/requirements.html) about requirements, an example project looks like: | ||
| ```bash | ||
| paddle_example | ||
| |-quick_start | ||
| |-trainer.py | ||
| |-dataset.py | ||
| |-requirements.txt | ||
| ``` | ||
|
|
||
| ## Submit Distributed Training Job With Python Code | ||
| <img src="./src/submit-job-python.png" width="800"> | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Docker build / Docker push应该是箭头那根线上的东西吧,这些都是动作,感觉应该放在线上,而不是图上(这里的图基本都是名词/一个概念)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
|
|
||
| - `paddle.job.dist_train()` will call the Job Server API `/v1/packages` to upload the trainer package and save them on CephFS, and then call `/v1/trainer/job` to submit the PaddlePaddle distributed job. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Miss a blank line...Done.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 为了把我们第一个版本说得具体,能不能这样改:change "The first version, we only implement submit the PaddlePaddle job in
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
| - `/v1/trainer/job` will start a building job for preparing the runtime Docker image. When the building job is finished, Job Server will submit the PaddlePaddle distributed job to Kubernetes. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 第一版本是通过JobServer来submit job,还是本地Python直接来?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 第一版就本地直接提交了,本地提交需要build runtime Docker image,这里补充了一句。 |
||
| - *NOTE*: For the first version, we will not prepare the runtime Docker image on JobServer, instead, we will build the runtime Docker image on our host. If the code is running on PaddleCloud, we will mount the trainer package in a temporary folder into the base Docker image. We will not support custom Python dependencies in the first version as well. | ||
|
||
|
|
||
| You can call `paddle.job.dist_train` and provide distributed training configuration as the parameters: | ||
| ```python | ||
| paddle.job.dist_train( | ||
| trainer=paddle.trainer.SGD(..., | ||
| paddle.updater.Adam(...)), | ||
| reader=reader, | ||
| paddle_job=PaddleJob( | ||
| runtime_image = "yancey1989/paddle-job", | ||
| job_name="paddle-job", | ||
| namespace="paddle-cloud", | ||
|
||
| cpu_nums=3, | ||
| memory="1G" | ||
| trainer_package="/example/word2vec", | ||
| entry_point="python %s" % __file__)) | ||
| ``` | ||
|
|
||
| The pseudo code of `paddle.job.dist_train` is as follows: | ||
| ```python | ||
| def dist_train(trainer, reader, num_passes=1, event_handler=None, feeding=None, paddle_job=None): | ||
| # if the code is running on cloud, set PADDLE_ON_CLOUD=YES | ||
| if os.getenv("PADDLE_ON_CLOUD", "NO") == "NO": | ||
| #submit the paddle job | ||
| paddle_job.submit() | ||
| else: | ||
| #start the training | ||
| trainer.train(reader, num_passes, event_handler, feeding) | ||
| ``` | ||
| ### PaddleJob Parameters | ||
| parameter | type | explanation | ||
| --- | --- | --- | ||
| job_name | str | the unique name for the training job | ||
| entry_point | str | entry point for startup trainer process | ||
| trainer_package | str | trainer package file path which user have the access right | ||
| base_image|str|the [base image](#base-docker-image) for building the [runtime image](#runtime-docker-image) | ||
| runtime_image|str| [runtime image](#runtime-docker-image) | ||
| memory|str| memory allocated for the job, a plain integer using one of these suffixes: E, P, T, G, M, K | ||
|
||
| cpu_nums|int| CPU count for the job | ||
| gpu_nums|int| GPU count for the job | ||
| pservers|int| Parameter Server process count | ||
| trainers|int| Trainer process count | ||
| pserver_cpu|int| CPU count for each Parameter Server process | ||
| pserver_mem|str| memory allocated for each Parameter Server process, a plain integer using one of these suffixes: E, P, T, G, M, K | ||
| trainer_cpu|int| CPU count for each Trainer process | ||
| trainer_mem|str| memory allocated for each Trainer process, a plain integer using one of these suffixes: E, P, T, G, M, K | ||
|
|
||
| ### Specify Resource for a Distributed Training Job | ||
| - Specify Job Resource | ||
|
||
| You can specify the resource for a job totally used, and the trainer count and pserver count will be calculated according to the job resources and Kubernetes physical architecture. | ||
| - you *may* specify `gpu_nums`, GPU count for the job totally used. | ||
| - you *must* specify `cpu_nums`, CPU count for the job totally used. | ||
| - you *must* specify `memory`, memory allocated for the job. | ||
| Example: | ||
| ```python | ||
| paddle_job = PaddleJob( | ||
| job_name = "paddle-cloud", | ||
| entry_point = "python train.py", | ||
| trainer_package = "/example/word2vec", | ||
| runtime_image = "yancey1989/paddle-job", | ||
| memory = "10G", | ||
| cpu_nums = 3, | ||
| gpu_nums = 3 | ||
| ) | ||
| ``` | ||
| - Specify Trainer and Parameter Server Resource | ||
| - you *must* specify `trainers`, Trainer count for the job. | ||
| - you *must* specify `pservers`, Parameter Server count for the job. | ||
| - you *must* specify `trainer_cpu`, CPU count for each Trainer process. | ||
| - you *must* specify `trainer_mem`, memory allocated for each Trainer process. | ||
| - you *must* specify `pserver_cpu`, CPU count for each Parameter Server process. | ||
| - you *must* specify `pserver_mem`, memory allocated for each Parameter Server process. | ||
| - you *may* specify `trainer_gpu`, GPU count for each Trainer process. | ||
| Example: | ||
| ```python | ||
| paddle_job = PaddleJob( | ||
| job_name = "paddle-cloud", | ||
| entry_point = "python train.py", | ||
| trainer_package = "/example/word2vec", | ||
| runtime_image = "yancey1989/paddle-job", | ||
| trainers = 10, | ||
| pservers = 3, | ||
| trainer_cpu = 1, | ||
| trainer_gpu = 1, | ||
| trainer_mem = "10G", | ||
| pserver_cpu = 1, | ||
| pserver_mem = "2G" | ||
| ) | ||
| ``` | ||
|
|
||
| ### Deploy Parameter Server, Trainer and Master Process | ||
| - Deploy PaddlePaddle Parameter Server processes, it's a Kubernetes ReplicaSet. | ||
| - Deploy PaddlePaddle Trainer processes, it's a Kubernetes Job. | ||
| - Deploy PaddlePaddle Master processes, it's a Kubernetes ReplicaSet. | ||
|
|
||
| ## Job Server | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 第一版本需要实现job server吗?https://github.com/PaddlePaddle/cloud/wiki/2017-05 中提到了“查看训练状态”是指网站通过job server的API取得信息,显示给用户看任务列表和任务log吗?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 这一版是想在网站后台直接实现JobServer的部分接口,例如查看训练状态和任务列表。 |
||
|
|
||
| - RESTful API | ||
|
|
||
| Job server provides RESTful HTTP API for receiving the trainer package and displaying | ||
| PaddlePaddle job related informations. | ||
| - `POST /v1/package` receive the trainer package and save them on CephFS | ||
| - `POST /v1/trainer/job` submit a trainer job | ||
| - `GET /v1/jobs/` list all jobs | ||
| - `GET /v1/jobs/<job-name>` the status of a job | ||
| - `DELETE /v1/jobs/<job-name>` delete a job | ||
| - `GET /v1/version` job server version | ||
|
|
||
| - Build Runtime Docker Image on Kubernetes | ||
|
|
||
| `paddle.job.dist_train` will upload the trainer package to Job Server, save them on the distributed filesystem, and then start up a job for building the runtime Docker image that gets scheduled by Kubernetes to run during training. | ||
|
|
||
| There are some benefits for building runtime Docker image on JobServer: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 我们最后觉得是在JobServer上build image还是在本地build image?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 这一期因为所有代码都是在云端存储上,所以就不build image了直接copy trainer_package比较简单。根据之前的讨论后续还是在JobServer上来build runtime Docker image感觉比较合适。
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
| - On Paddle Cloud, users will run the trainer code in a Jupyter Notebook which is a Kubernetes Pod, if we want to execute `docker build` in the Pod, we should mount the host's `docker.sock` to the Pod, user's code will connect the host's Docker Engine directly, it's not safe. | ||
| - Users only need to upload the training package files, does not need to install docker engine, docker registry as dependencies. | ||
| - If we want to change another image type, such as RKT, users do not need to care about it. | ||
|
|
||
| - Deploy Parameter Server, Trainer and Master Processes | ||
|
|
||
| `POST /v1/trainer/job` receives the distributed training parameters, and deploy the job as follows: | ||
| - Deploy PaddlePaddle Parameter Server processes, it's a Kubernetes ReplicaSet. | ||
| - Deploy PaddlePaddle Trainer processes, it's a Kubernetes Job. | ||
| - Deploy PaddlePaddle Master processes, it's a Kubernetes ReplicaSet. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 只有一个master,需要ReplicaSet吗?(或者什么别的更合适一点),我不是特清楚部署的时候用哪个比较好,请教一下。
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ReplicaSe比较简单了,主要是为了当Master节点挂掉时,Kubernetes会重新将Master调度起来,用Pod将会失去这个特性。
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 好的,明白了! |
||

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a paragraph to describe the flow in a big picture. The paragraph below goes directly to the detail of
paddle.dist_train, the reader needs a big picture to follow the concept.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.