Skip to content

Conversation

@jeet4320
Copy link
Contributor

@jeet4320 jeet4320 commented May 5, 2021

Issue #, if available:

PR Checklist

  • I've prepended PR tag with frameworks/job this applies to : [mxnet, tensorflow, pytorch] | [ei/neuron] | [build] | [test] | [benchmark] | [ec2, ecs, eks, sagemaker]
  • (If applicable) I've documented below the DLC image/dockerfile this relates to
  • (If applicable) I've documented below the tests I've run on the DLC image
  • (If applicable) I've reviewed the licenses of updated and new binaries and their dependencies to make sure all licenses are on the Apache Software Foundation Third Party License Policy Category A or Category B license list. See https://www.apache.org/legal/resolved.html.
  • (If applicable) I've scanned the updated and new binaries to make sure they do not have vulnerabilities associated with them.

Pytest Marker Checklist

  • (If applicable) I have added the marker @pytest.mark.model("<model-type>") to the new tests which I have added, to specify the Deep Learning model that is used in the test (use "N/A" if the test doesn't use a model)
  • (If applicable) I have added the marker @pytest.mark.integration("<feature-being-tested>") to the new tests which I have added, to specify the feature that will be tested
  • (If applicable) I have added the marker @pytest.mark.multinode(<integer-num-nodes>) to the new tests which I have added, to specify the number of nodes used on a multi-node test
  • (If applicable) I have added the marker @pytest.mark.processor(<"cpu"/"gpu"/"eia"/"neuron">) to the new tests which I have added, if a test is specifically applicable to only one processor type

EIA/NEURON Checklist

  • When creating a PR:
  • I've modified src/config/build_config.py in my PR branch by setting ENABLE_EI_MODE = True or ENABLE_NEURON_MODE = True
  • When PR is reviewed and ready to be merged:
  • I've reverted the code change on the config file mentioned above

Benchmark Checklist

  • When creating a PR:
  • I've modified src/config/test_config.py in my PR branch by setting ENABLE_BENCHMARK_DEV_MODE = True
  • When PR is reviewed and ready to be merged:
  • I've reverted the code change on the config file mentioned above

Reviewer Checklist

  • For reviewer, before merging, please cross-check:
  • I've verified the code change on the config file mentioned above has already been reverted

Description:

Tests run:

DLC image/dockerfile:

Additional context:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@jeet4320 jeet4320 changed the title Add RDMAV_FORK_SAFE [pytorch][tensorflow][build][test] Add RDMAV_FORK_SAFE May 5, 2021
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /example, /Dockerfile., *DEVICE_TYPE ]
context:
<<: *TRAINING_CONTEXT
BuildCPUPTInferencePy3DockerImage:
Copy link
Contributor Author

@jeet4320 jeet4320 May 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be reverted, skipping inference as training has fix

@tejaschumbalkar
Copy link
Contributor

Changes looks good. Please confirm the same with EFA team as well

@jeet4320
Copy link
Contributor Author

jeet4320 commented May 5, 2021

Adding RDMAV_FORK_SAFE fixes tests, and there is no other way to fix those if EFA is installed

@karan6181
Copy link
Contributor

karan6181 commented May 5, 2021

Does the training log emits that the flag RDMAV_FORK_SAFE has been set ?

@jeet4320
Copy link
Contributor Author

jeet4320 commented May 5, 2021

Will confirm by pulling DLC image.

@karan6181
Copy link
Contributor

karan6181 commented May 5, 2021

Also please check the EFA has been selected as a provider. This (NCCL INFO NET/OFI Selected Provider is efa) is the line you should look at. Also, ensure singlenode and multi-node test passes.

@jeet4320
Copy link
Contributor Author

jeet4320 commented May 5, 2021

Verified image

root@5e4759b5785c:/# echo $RDMAV_FORK_SAFE
1

akhilmehra
akhilmehra previously approved these changes May 5, 2021
@mansimane
Copy link
Contributor

mansimane commented May 5, 2021

Changes look good to me. One question - Do you know why we do not have benchmark test for pytorch similar to tensorflow -test/dlc_tests/benchmark/sagemaker/tensorflow/training/resources/tf_sm_benchmark.py ? Is equivalent test present in some other folder?

@jeet4320
Copy link
Contributor Author

jeet4320 commented May 5, 2021

For PT, its not there

ENABLE_NEURON_MODE = False
# Frameworks for which you want to disable both builds and tests
DISABLE_FRAMEWORK_TESTS = []
DISABLE_FRAMEWORK_TESTS = ["mxnet", "huggingface_pytorch", "huggingface_tensorflow"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be reverted changes in this file

@jeet4320
Copy link
Contributor Author

jeet4320 commented May 5, 2021

All tests for TF24.1 and PT1.8.1 Passed

long_name = framework_name
short_name = frameworks[long_name]
codebuild_version = os.getenv("CODEBUILD_RESOLVED_SOURCE_VERSION")[0:7]
num_nodes = 1 if is_pr_context() else 3 if long_name != "pytorch" else 4
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be reverted, this is needed to run multinode eks on PR

"-x HOROVOD_HIERARCHICAL_ALLREDUCE=1 "
"-x HOROVOD_FUSION_THRESHOLD=16777216 "
"-x TF_CPP_MIN_LOG_LEVEL=3 "
"-x RDMAV_FORK_SAFE=1"
Copy link
Contributor Author

@jeet4320 jeet4320 May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing RDMAV_FORK_SAFE from tests

indhub
indhub previously approved these changes May 6, 2021
@jeet4320
Copy link
Contributor Author

jeet4320 commented May 6, 2021

FAILED integration/sagemaker/test_mnist.py::test_smdataparallel_smmodelparallel_mnist[gpu-3]

rerunning it as a single test

efa_tests = [mark for mark in item.iter_markers("efa")]
if not efa_tests:
pytest.skip("Skipping non-efa tests")
if efa_tests and are_efa_tests_disabled():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove it, by mistake added it

Copy link
Contributor

@saimidu saimidu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved

&& rm -rf /tmp/efa \
&& rm -rf /tmp/aws-efa-installer-${EFA_VERSION}.tar.gz

RUN echo "pml = ob1" >> /opt/amazon/openmpi/etc/openmpi-mca-params.conf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: [ADDRESS LATER] Use OPEN_MPI_PATH because we have already assigned /opt/amazon/openmpi to an ARG.

RUN echo NCCL_DEBUG=INFO >> /etc/nccl.conf

ENV LD_LIBRARY_PATH=$OPEN_MPI_PATH/lib:$LD_LIBRARY_PATH
RUN echo "pml = ob1" >> /opt/amazon/openmpi/etc/openmpi-mca-params.conf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: [ADDRESS LATER] Use OPEN_MPI_PATH because we have already assigned /opt/amazon/openmpi to an ARG.

Comment on lines 126 to 131
if job_type == "training":
if framework == "tensorflow":
if framework_major_version == "2":
integration_path = f"integration/sagemaker/test_mnist.py::test_smdataparallel_smmodelparallel_mnist"
else:
integration_path = f"integration/sagemaker/test_tuning_model_dir.py"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be reverted.

Comment on lines 55 to 56
if efa_tests and are_efa_tests_disabled():
pytest.skip('Skipping EFA tests as EFA tests are disabled.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be reverted.

@jeet4320 jeet4320 merged commit d6f0e97 into aws:master May 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants