-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Design doc: submit a distributed job #1770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
3331bda
7faa331
718f901
fd9e1c2
c095a93
d74d9ba
f4c7bd2
bb7263f
005c3e1
0f113d3
b6969f9
1771707
a21743a
d643295
5827dc1
02d18b2
68ff895
1fd8900
a57bb04
1987b45
6f097a3
bfdd1a3
b56e7e7
b8e63d9
6cbf80d
cb39a81
063805d
e2e6875
5ec1deb
53b5afa
198d0d1
838509b
259731a
a5a0aeb
080e633
05d6e00
8486227
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,137 @@ | ||
|
|
||
| # PaddlePaddle Client | ||
| PaddlePaddle client is command line tool, you can use a PaddlePaddle client to start a local train and submit a distributed training job to kubernetes cluster. | ||
|
|
||
| The relation of PaddlePaddle, kubernetes and docker: | ||
|
||
|
|
||
| <img src="./submit-job.png" width="500"> | ||
|
||
|
|
||
|
|
||
| # Running your training locally | ||
| Execute `paddle local train` to run your local train. | ||
| ```bash | ||
| paddle local train | ||
|
||
| --pcakage-path=./demo | ||
|
||
| --module=demo.train | ||
| --input=<input_dir> | ||
| --output=<output_dir> | ||
| --image=<paddle_image> | ||
| --e=NUM_PASS=4 | ||
|
||
| ``` | ||
| - package-path: your trainer code python package | ||
| - module: include a main function, trainer entrance. | ||
| - input: input directory, for local train, it's a host path. | ||
| - output: output directory, for local train, it's a host path. | ||
| - image: paddlepaddle production image | ||
| - e: environment varible | ||
|
|
||
| When you start the local train, the client starts a docker container like: | ||
|
||
| ```bash | ||
| docker run --rm | ||
| -v <input_dir>:/train/input | ||
|
||
| -v <output_dir>:/train/output | ||
| -v <package-path>:/train/package | ||
| -e NUM_PASS=4 <paddle_image> | ||
|
||
| python /train/package/train.py | ||
| ``` | ||
|
|
||
|
|
||
| # Submit distributed training job | ||
| You can use `paddle submit job <job-name>` to submit a distributed training job. | ||
|
|
||
| ```bash | ||
| paddle job submit train <job-name> | ||
|
||
| --package-path=/train/quick_start | ||
| --module=quick_start.train | ||
| --input=<input_dir> | ||
| --output=<output_dir> | ||
| --trainers=4 | ||
| --pservers=2 | ||
| --image:<your image> | ||
| -e=NUM_PASS=5 | ||
| ``` | ||
|
|
||
| - job-name: you should specify a unique job name, | ||
| - package-path=your python package files | ||
| - module: include the main function, trainer entrance | ||
| - input: input directory on distributed file system | ||
| - output: output directory on distributed file system | ||
| - trainers: trainer process count | ||
| - pserver: parameter process count | ||
| - image: your trainer docker image, include your trainer files and dependencies. | ||
| - command: | ||
| - e: environment variable | ||
|
|
||
| ## Build your docker image | ||
| Before submitting a distributed training, you should build your docker image, here | ||
|
||
| is a simple example project: | ||
| ```bash | ||
| paddle_example | ||
| |-Dcokerfile | ||
| `-quick_start | ||
| |-trainer.py | ||
| `-dataset.py | ||
| ``` | ||
| Execute `docker build -t <your repo>/paddle_dist_example .` on directory `paddle_example` and then | ||
| push the image use `docker push <your repo>/paddle_dist_example` | ||
| `Dockerfile` should add your python package to the image: | ||
| ```bash | ||
| FROM:paddlepaddle/paddle:0.10.0rc2 | ||
| ADD ./quick_start /train/quick_start | ||
| CMD ["python", "/train/quick_stat/train.py"] | ||
| ``` | ||
|
|
||
| ## Master process | ||
| Master process a bootstrap and manager process for a distributed job, it deploys parameter server process, trainer process and dispatch task for the trainers, it is implemented by golang. | ||
|
|
||
| - Setup master process | ||
|
|
||
| While user submits a distributed training job, PaddlePaddle client deploys a master process which is a Job resource naming `<job-name>-master` on kubernetes. | ||
|
|
||
| - Startup pservers and trainers | ||
|
|
||
| Master process will deploy pserver and trainer on kubernetes, they are also job resource, naming `<job-name>-pserver` and `<job-name>-trainer`. Because of trainer need the IP of pserver, so there should be a dependency for the startup order. | ||
| - Deploy pserver job, and waiting for the status becoming `RUNINIG`. | ||
| - Fetch all pserver's IP as trainer parameters. | ||
| - Deploy trainer job. | ||
|
|
||
| - Dispatch task to trainer | ||
|
|
||
| Detail description is [here](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/dist#master-process) | ||
|
|
||
| ## Data source | ||
| - Distributed file system | ||
|
|
||
| You can upload your training data to distributed file system, such as GlustereFS, | ||
| PaddlePaddle support a default reader for reading data from distributed file system. | ||
| - HTTP server | ||
|
|
||
| TODO | ||
|
||
| - Real-time data | ||
|
|
||
| TODO | ||
|
|
||
| ## PaddlePaddle client commands: | ||
|
||
| - `paddle local` | ||
|
|
||
| - `paddle local train` | ||
|
|
||
| start up a local train. | ||
| - `paddle job` | ||
|
|
||
| Start up a distributed job | ||
| - `paddle job submit` | ||
|
|
||
| submit a PaddlePaddle distributed training using kubectl. | ||
| - `paddle job status` | ||
|
|
||
| check the job status | ||
| - `paddle job list` | ||
|
|
||
| list existing PaddlePadle distributed job on kubernetes | ||
| - `paddle job cancel` | ||
|
|
||
| cancel a running PaddlePaddle distributed job. | ||
| - `paddle version` | ||
|
|
||
| show PaddlePaddle client version | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are describing the PaddlePaddle client, do you need to describe the full features of it, like local training, show version etc.