-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Fault tolerant distributed training, just work version, with etcd #2849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
5c14ec2
c27303b
9a7cbcf
7cab079
39d5857
b75f190
a585318
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,23 @@ | ||
| import paddle.v2 as paddle | ||
| import paddle.v2.dataset.uci_housing as uci_housing | ||
| import paddle.v2.master as master | ||
| import os | ||
| import cPickle as pickle | ||
|
|
||
| etcd_ip = os.getenv("MASTER_IP", "127.0.0.1") | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 不好意思,这里上次没看见,是不是应该把“MASTER_IP”改成"ETCD_IP"? master貌似跟etcd没有关系。
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 目前paddlecloud启动job时,etcd是启动在master pod里的。后续考虑多个job使用同一个etcd的情况下修改下。 |
||
| etcd_endpoint = "http://" + etcd_ip + ":2379" | ||
|
|
||
|
|
||
| def cloud_reader(): | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can put this into
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes! Will do this in next PR. |
||
| print "connecting to master, etcd endpoints: ", etcd_endpoint | ||
| master_client = master.client(etcd_endpoint, 5, 64) | ||
| master_client.set_dataset( | ||
| ["/pfs/dlnel/public/dataset/uci_housing/uci_housing-*-of-*"]) | ||
| while 1: | ||
| r, e = master_client.next_record() | ||
| if not r: | ||
| break | ||
| yield pickle.loads(r) | ||
|
|
||
|
|
||
| def main(): | ||
|
|
@@ -22,13 +40,13 @@ def main(): | |
| # create optimizer of new remote updater to pserver | ||
| optimizer = paddle.optimizer.Momentum(momentum=0) | ||
|
|
||
| #TODO(zhihong) : replace optimizer with new OptimizerConfig | ||
|
|
||
| print "etcd endoint: ", etcd_endpoint | ||
| trainer = paddle.trainer.SGD(cost=cost, | ||
| parameters=parameters, | ||
| update_equation=optimizer, | ||
| is_local=False, | ||
| pserver_spec="localhost:3000") | ||
| pserver_spec=etcd_endpoint, | ||
| use_etcd=True) | ||
|
|
||
| # event_handler to print training and testing info | ||
| def event_handler(event): | ||
|
|
@@ -47,11 +65,11 @@ def event_handler(event): | |
| print "Test %d, %.2f" % (event.pass_id, result.cost) | ||
|
|
||
| # training | ||
| # NOTE: use uci_housing.train() as reader for non-paddlecloud training | ||
| trainer.train( | ||
| reader=paddle.batch( | ||
| paddle.reader.shuffle( | ||
| uci_housing.train(), buf_size=500), | ||
| batch_size=2), | ||
| cloud_reader, buf_size=500), batch_size=2), | ||
| feeding={'x': 0, | ||
| 'y': 1}, | ||
| event_handler=event_handler, | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -28,6 +28,17 @@ NewRemoteParameterUpdater::NewRemoteParameterUpdater( | |
| newGradients_(nullptr), | ||
| pserverSpec_(pserverSpec) {} | ||
|
|
||
| NewRemoteParameterUpdater::NewRemoteParameterUpdater( | ||
| const OptimizationConfig &config, | ||
| const std::string pserverSpec, | ||
| const bool useEtcd) | ||
| : trainerConfig_(config), | ||
| parameterClient_(-1), | ||
| newParameters_(nullptr), | ||
| newGradients_(nullptr), | ||
| pserverSpec_(pserverSpec), | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 这里有时候是pserver的addresses,有时候是etcd的address,是不是该改成remoteAddr之类的。 |
||
| useEtcd_(useEtcd) {} | ||
|
|
||
| void NewRemoteParameterUpdater::init( | ||
| const std::vector<ParameterPtr> ¶meters) { | ||
| ParameterUpdater::init(parameters); | ||
|
|
@@ -38,8 +49,13 @@ void NewRemoteParameterUpdater::init( | |
| } | ||
|
|
||
| // create parameter server client. | ||
| parameterClient_ = paddle_new_pserver_client((char *)pserverSpec_.c_str(), | ||
| FLAGS_trainer_id == 0); | ||
| if (useEtcd_) { | ||
| parameterClient_ = paddle_new_etcd_pserver_client( | ||
| (char *)pserverSpec_.c_str(), FLAGS_trainer_id == 0); | ||
| } else { | ||
| parameterClient_ = paddle_new_pserver_client((char *)pserverSpec_.c_str(), | ||
| FLAGS_trainer_id == 0); | ||
| } | ||
|
|
||
| // init new parameter and gradient. | ||
| newParameters_ = initNewParameter(PARAMETER_VALUE); | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -51,7 +51,8 @@ def __init__(self, | |
| update_equation, | ||
| extra_layers=None, | ||
| is_local=True, | ||
| pserver_spec=None): | ||
| pserver_spec=None, | ||
| use_etcd=True): | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we can make the default requires less dependency, by defaulting
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
|
||
| if not isinstance(parameters, v2_parameters.Parameters): | ||
| raise TypeError('parameters should be parameters') | ||
|
|
@@ -66,6 +67,7 @@ def __init__(self, | |
| self.__topology_in_proto__ = topology.proto() | ||
| self.__is_local__ = is_local | ||
| self.__pserver_spec__ = pserver_spec | ||
| self.__use_etcd__ = use_etcd | ||
|
|
||
| self.__use_sparse_updater__ = self.__topology__.use_sparse_updater() | ||
| # # In local mode, disable sparse_remote_update. | ||
|
|
@@ -130,7 +132,7 @@ def train(self, reader, num_passes=1, event_handler=None, feeding=None): | |
|
|
||
| self.__parameter_updater__ = self.__optimizer__.create_updater( | ||
| self.__is_local__, num_passes, self.__use_sparse_updater__, | ||
| self.__pserver_spec__) | ||
| self.__pserver_spec__, self.__use_etcd__) | ||
| self.__parameter_updater__.init(self.__gradient_machine__) | ||
|
|
||
| self.__gradient_machine__.start() | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove the TODO, since it's completed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.