-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Design doc: submit a distributed job #1770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 15 commits
3331bda
7faa331
718f901
fd9e1c2
c095a93
d74d9ba
f4c7bd2
bb7263f
005c3e1
0f113d3
b6969f9
1771707
a21743a
d643295
5827dc1
02d18b2
68ff895
1fd8900
a57bb04
1987b45
6f097a3
bfdd1a3
b56e7e7
b8e63d9
6cbf80d
cb39a81
063805d
e2e6875
5ec1deb
53b5afa
198d0d1
838509b
259731a
a5a0aeb
080e633
05d6e00
8486227
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,157 @@ | ||
| # Submit a Distributed Training Job | ||
|
|
||
| If a user wants to start up a local train, he will start up a PaddlePaddle product Docker container firstly, and then | ||
| execute `python train.py` in the Docker container.The details about PaddlePaddle Docker image is [here](../../../paddle/scripts/docker/README.md) | ||
|
|
||
| If a user wants to start up a distributed training job, he will submit the distributed training job in python code, or use a command line tool. | ||
|
||
|
|
||
| The relation of PaddlePaddle, kubernetes and docker: | ||
|
||
|
|
||
|
|
||
| # Runtime Environment On kubernetes | ||
|
||
|
|
||
| For a distributed training job, there is two docker image called `runtime docker image` and `base docker image`, the `runtime docker image` is actually running in kubernetes. | ||
|
||
|
|
||
| - Base Docker Image | ||
|
|
||
| Usually, the `base docker image` is PaddlePaddle product docker image including paddle binary files and trainer startup script file. And of course, users can specify any image name hosted on any docker registry which users have the right access. | ||
|
|
||
| - Runtime Docker Image | ||
|
|
||
| Package the trainer package which user upload and some python dependencies into a `runtime docker image` base on `base docker image`, this is done automatically by Job Server. | ||
|
|
||
| - Python Dependencies | ||
|
|
||
| Users will provide a `requirments.txt` file in trainer packages, to list python dependencies packages, such as: | ||
|
||
| ```txt | ||
| pillow | ||
| protobuf==3.1.0 | ||
| ``` | ||
| some other details about `requirements` is [here](https://pip.readthedocs.io/en/1.1/requirements.html). | ||
|
||
|
|
||
| Here is an example project: | ||
|
||
| ```bash | ||
| paddle_example | ||
| |-quick_start | ||
| |-trainer.py | ||
| |-dataset.py | ||
| |-requirments.txt | ||
| ``` | ||
| Execute the command: `paddle train --trainer-package=./paddle_eample/quick_start ...`, PaddlePaddle client will upload the trainer package(quick_start)and setup parameters to [Job Server](#job-server) | ||
|
||
|
|
||
| ## Submit a Distributed Training Job In Python Code | ||
|
||
| <img src="./submit-job-python.png" width="800"> | ||
|
||
|
|
||
| Users will call `paddle.dist_train` and provide distributed training configuration as the parameters. | ||
|
||
| ```python | ||
| paddle.dist_train( | ||
|
||
| model, | ||
| trainer=paddle.trainer.SGD(..., | ||
| paddle.updater.Adam(...)), | ||
| reader=reader, | ||
| job_name="quickstart", | ||
| trainers=8, | ||
|
||
| pservers=4, | ||
| input=/quickstart/input, | ||
|
||
| output=/quickstart/output, | ||
| base_image="paddlepaddle/paddle:0.10.rc2", | ||
| use_gpu=False) | ||
| ``` | ||
|
|
||
| - Build Runtime Docker Image on Kubernetes | ||
|
|
||
| `paddle.dist_train` will deploy a kubernetes job, build and push runtime docker image in the pod. Parameter Server and Trainer pod will use the runtime docker image. | ||
|
|
||
| There are some benefits for building Docker image on the kubernetes: | ||
| - `Docker in Docker` should mount `docker.sock` in the container and set `--privileged`, if the code running in a kubernetes pod, it's not safety. | ||
| - Users only need to upload the training package files, does not dependency docker engine, docker registry. | ||
| - If we want to change another image type, such as RKT, the user does not need to care about it. | ||
|
|
||
| - Startup Parameter Server and Trainer Job | ||
| - Deploy parameter server job, it's a kubernetes StatefulSet. | ||
| - Deploy trainer job, it's a kubernetes Job. | ||
|
|
||
| ## Submit a Distributed Training Job With a Command Line Tool | ||
| <img src="./submit-job-command-line.png" width="800"> | ||
|
|
||
| - Configurate PaddlePaddle Client | ||
|
|
||
| Users should configure PaddlePaddle client by the configuration file firstly, the default path: | ||
| `$HOME/.paddle/config`. | ||
|
|
||
| ```yaml | ||
| apiVersion: v1 | ||
| dockerRegistry: | ||
| domain: domain.com //default is docker.io | ||
| username: <username> | ||
| password: <password> | ||
| jobServer: http://<job server domain>:<job server port> | ||
| ``` | ||
|
|
||
| - Submit a Distributed Training Job | ||
| Users will execute the command `paddle job submit` and provides distributed training configuration as the parameters. | ||
| ```bash | ||
| paddle job submit\ | ||
| --job-name=cluster-quickstart \ | ||
| --trainer-package=$PWD/quick_start \ | ||
| --entry-point="python train.py" \ | ||
| --input=<input-dir> \ | ||
| --output=<output-dir> \ | ||
| --trainers=4 \ | ||
|
||
| --pservers=2 \ | ||
| --base-image:<paddle-image> \ | ||
| --use-gpu=true \ | ||
| --trainer-gpu-num=1 \ | ||
| --env="NUM_PASS=5" | ||
| ``` | ||
| - `job-name`: you should specify a unique job name | ||
| - `trainer-package`: python package files on your host | ||
| - `entry-point`: an entry point for startup trainer process | ||
| - `input`: input directory on distributed file system | ||
| - `output`: output directory on distributed file system | ||
| - `trainers`: if `use-gpu=false`, users should configurate the trainer process count | ||
| - `pserver`: parameter process count | ||
| - `base-image`: your trainer docker image, include your trainer files and dependencies. | ||
| - `use-gpu`: whether it is a GPU train | ||
| - `trainer-gpu-num`: how much GPU card for one paddle trainer process, it's requirements only if `use-gpu=true`, | ||
| - `env`: environment variable | ||
|
|
||
| The command `paddle train` will package the trainer package to a `trainer.tar.gz` file, call `POST /v1/package` to upload the trainer package file. and then call `POST /v1/trainer/job` to start up a distributed job. | ||
|
|
||
| - PaddlePaddle Client Commands: | ||
| The command line tool also supports the following subcommands: | ||
| - `paddle train`: start a training job | ||
| - `paddle list`: list all PaddlePaddle jobs in current namespace | ||
| - `paddle cancel`: cancel a running job. | ||
| - `paddle status`: status of a PaddlePaddle job | ||
| - `paddle version`: show job client and job server version info. | ||
| - `paddle upload`: upload training data to distributed storage. | ||
| - `paddle download`: download training data from a distributed storage. | ||
|
|
||
|
|
||
| # Job Server | ||
| Job server is running on kubernetes, users will configure the server address in [PaddlePaddle client configuration file](#configurate-paddlepaddle-client) | ||
|
|
||
| - RESTful API | ||
|
|
||
| Job server provides a RESTful HTTP server receives the trainer packages, list PaddlePaddle job etc... | ||
| - `POST /v1/package` receive the trainer package and save them on GlustereFS | ||
| - `POST /v1/trainer/job` submit a trainer job | ||
| - `GET /v1/jobs/` list all job | ||
| - `GET /v1/jobs/<job-name>` the status of a job | ||
| - `DELETE /v1/jobs/<job-name>` cancel a job | ||
| - `GET /v1/version` job server version | ||
|
|
||
| - Build Runtime Docker Image On Kubernetes | ||
|
|
||
| Job Server deploys a kubernetes Job and builds runtime docker image in Pod, pserver and trainer pod will use this runtime docker image to startup pserver and trainer process. | ||
|
|
||
| - Start Up PSrvers and Trainers Job | ||
| - Deploy pserver job, it's a kubernetes StatefulSet. | ||
| - Deploy trainer job, it's a kubernetes Job. | ||
|
|
||
| # Work Feature | ||
| - V1 | ||
| - Submit a distributed training job in python code. | ||
| - V2 | ||
| - Submit a distributed training job with command line tool, build/push docker image and deploy pserver/trainer job in Job Server. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
product -> production