-
Notifications
You must be signed in to change notification settings - Fork 1.2k
How do I use MXNet's distributed key-value store in this framework? #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
From the "Distributed MXNet training" section of the readme in https://github.com/aws/sagemaker-python-sdk :
So essentially, SageMaker will set up the cluster for MXNet distributed training. However, your user code will need to specify the type of KVStore to use. The code you need to write to do this will depend on which MXNet API you are using. One of our examples shows how to set this parameter when using the Module API: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/mxnet_mnist/mnist.py#L45-L56 You can see how the hosts parameter is used there to determine whether to use local (if running with a single machine) or dist_sync (if running with multiple machines). |
I tried playing with this, but when I made changes to the the Gluon code and running on a single host, I get the following errors,
My naive feeling is that SageMaker can't handle sending local scripts OR if you run on only one host, DMLC is never started, to the query to the environment variable fails. I am running on multiple hosts to check. |
@aidan-plenert-macdonald right now our framework doesn't set the DMLC_NUM_WORKER variable when your training cluster only has one host. We'll look into setting that or at least improving the error message. In the meantime your code ought to work if you a. run your job one two or more instances, or
|
@aidan-plenert-macdonald Also, this only happens when you are using a distributed kvstore on a single machine cluster. It looks like there's a bug the if statement you added: hosts is a list of hostnames, and apparently python2 says lists are > ints, so that's how you're hitting this case. Fixing that if statement to use len(hosts) instead should also unblock you. |
Thanks! I ran it with multiple instances and it worked. I'll make those changes and submit a PR for the examples. |
feature: pathways job side driver code
Uh oh!
There was an error while loading. Please reload this page.
Looking at your MXNet training script documentation, I see,
The only way I have seen to do distributed training in MXNet is with Distributed Key-Value Stores which run on DMLC via MPI/SSH like,
This launch script is not something that could easily changed.
So how am I supposed to use the hosts list you pass into my SageMaker training function? (see this too).
The text was updated successfully, but these errors were encountered: