Skip to content

How do I use MXNet's distributed key-value store in this framework? #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aodhan-domhnaill opened this issue Dec 4, 2017 · 5 comments

Comments

@aodhan-domhnaill
Copy link

aodhan-domhnaill commented Dec 4, 2017

Looking at your MXNet training script documentation, I see,

hosts (list[str]): The list of host names running in the SageMaker Training Job cluster.

The only way I have seen to do distributed training in MXNet is with Distributed Key-Value Stores which run on DMLC via MPI/SSH like,

$ mxnet_path/tools/launch.py -H hostfile -n 2 python myprog.py

This launch script is not something that could easily changed.

So how am I supposed to use the hosts list you pass into my SageMaker training function? (see this too).

@winstonaws
Copy link
Contributor

winstonaws commented Dec 5, 2017

From the "Distributed MXNet training" section of the readme in https://github.com/aws/sagemaker-python-sdk :

You can run a multi-machine, distributed MXNet training using the MXNet Estimator. By default,
MXNet objects will submit single-machine training jobs to SageMaker. If you set train_instance_count to be greater than one, multi-machine training jobs will be launched when fit is called. When you run multi-machine training, SageMaker will import your training script and invoke train on each host in the cluster.
When you develop MXNet distributed learning algorithms, you often want to use an MXNet kvstore to store and share model parameters. To learn more about writing distributed MXNet programs, please see Distributed Training in the MXNet docs.
When using an MXNet Estimator, SageMaker automatically starts MXNet kvstore server and scheduler processes on hosts in your training job cluster. Your script runs as an MXNet worker task. SageMaker runs one server process on each host in your cluster. One host is selected arbitrarily to run the scheduler process.

So essentially, SageMaker will set up the cluster for MXNet distributed training. However, your user code will need to specify the type of KVStore to use. The code you need to write to do this will depend on which MXNet API you are using.

One of our examples shows how to set this parameter when using the Module API: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/mxnet_mnist/mnist.py#L45-L56

You can see how the hosts parameter is used there to determine whether to use local (if running with a single machine) or dist_sync (if running with multiple machines).

@aodhan-domhnaill
Copy link
Author

aodhan-domhnaill commented Dec 5, 2017

I tried playing with this, but when I made changes to the the Gluon code and running on a single host, I get the following errors,

ValueError: Error training sagemaker-mxnet-py2-cpu-{some date and time}: Failed Reason: AlgorithmError: uncaught exception during training: [18:42:49] src/postoffice.cc:16: Check  notnull: Environment::Get()->find("DMLC_NUM_WORKER")

My naive feeling is that SageMaker can't handle sending local scripts OR if you run on only one host, DMLC is never started, to the query to the environment variable fails. I am running on multiple hosts to check.

@jesterhazy jesterhazy self-assigned this Dec 5, 2017
@jesterhazy
Copy link
Contributor

@aidan-plenert-macdonald right now our framework doesn't set the DMLC_NUM_WORKER variable when your training cluster only has one host. We'll look into setting that or at least improving the error message.

In the meantime your code ought to work if you

a. run your job one two or more instances, or
b. set the environment variable yourself before you create the the kvstore:

import os
os.environ['DMLC_NUM_WORKER'] = '1'

@jesterhazy jesterhazy removed their assignment Dec 5, 2017
@winstonaws
Copy link
Contributor

@aidan-plenert-macdonald Also, this only happens when you are using a distributed kvstore on a single machine cluster. It looks like there's a bug the if statement you added:
https://github.com/aidan-plenert-macdonald/amazon-sagemaker-examples/commit/8890bc600aa0632c073bea14de18494acd5540d0#diff-d392b07f45d064989278867357e33ca7R44

hosts is a list of hostnames, and apparently python2 says lists are > ints, so that's how you're hitting this case. Fixing that if statement to use len(hosts) instead should also unblock you.

@aodhan-domhnaill
Copy link
Author

Thanks! I ran it with multiple instances and it worked. I'll make those changes and submit a PR for the examples.

laurenyu pushed a commit to laurenyu/sagemaker-python-sdk that referenced this issue May 31, 2018
nmadan pushed a commit to nmadan/sagemaker-python-sdk that referenced this issue Apr 18, 2023
feature: pathways job side driver code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants