Add demo about Click-Through-Rate distributed training with PaddlePad…#434
Conversation
| type: "" | ||
| name: seqdata | ||
| containers: | ||
| - image: sivanzcw/edlctr:v1 |
There was a problem hiding this comment.
let's move this image to volcanosh
There was a problem hiding this comment.
Already replaced with volcanosh image
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: k82cn, sivanzcw The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| - name: TRAINER_PACKAGE | ||
| value: /workspace | ||
| - name: PADDLE_INIT_NICS | ||
| value: eth2 |
There was a problem hiding this comment.
I am not familiar with paddlepaddle, but what's this?
There was a problem hiding this comment.
PADDLE INIT NICS is used to pass the nics parameter to specify the network card in the paddle pserver or paddle train command here
start_pserver() {
stdbuf -oL paddle pserver \
--use_gpu=0 \
--port=$PADDLE_INIT_PORT \
--ports_num=$PADDLE_INIT_PORTS_NUM \
--ports_num_for_sparse=$PADDLE_INIT_PORTS_NUM_FOR_SPARSE \
--nics=$PADDLE_INIT_NICS \
--comment=paddle_process_k8s \
--num_gradient_servers=$PADDLE_INIT_NUM_GRADIENT_SERVERS
}The default value is taken from the relevant configuration in the PaddlePaddle EDL demo here
There was a problem hiding this comment.
What if eth2 not exist in a container?
There was a problem hiding this comment.
After testing, for this demo, setting a non-existent NIC for PaddlePaddle does not affect the training process. Based on the original Baidu demo, this parameter was removed. During the implementation of the demo, each pod filters out the server and trainer components through the preset PServer pod lable and Trainer pod lable, thus obtaining the ip list of the server and trainer components. When obtaining the server ip list, the system assign a port to the server at the same time. After the server gets the assigned port, it starts listening to the corresponding port and provides services. Finally, each server or trainer component in the computing cluster knows the detailed communication address of other components under the cluster, so the components can communicate directly without using the network card configuration.
adf9388 to
9e93105
Compare
9e93105 to
3d10085
Compare
|
/lgtm |
Add demo about distributed training with PaddlePaddle on Volcano, source demo taken from https://github.com/PaddlePaddle/edl/tree/develop/example/ctr