[pytorch][tensorflow][build][test] Add RDMAV_FORK_SAFE #1090

jeet4320 · 2021-05-05T03:02:27Z

Issue #, if available:

PR Checklist

I've prepended PR tag with frameworks/job this applies to : [mxnet, tensorflow, pytorch] | [ei/neuron] | [build] | [test] | [benchmark] | [ec2, ecs, eks, sagemaker]
(If applicable) I've documented below the DLC image/dockerfile this relates to
(If applicable) I've documented below the tests I've run on the DLC image
(If applicable) I've reviewed the licenses of updated and new binaries and their dependencies to make sure all licenses are on the Apache Software Foundation Third Party License Policy Category A or Category B license list. See https://www.apache.org/legal/resolved.html.
(If applicable) I've scanned the updated and new binaries to make sure they do not have vulnerabilities associated with them.

Pytest Marker Checklist

(If applicable) I have added the marker @pytest.mark.model("<model-type>") to the new tests which I have added, to specify the Deep Learning model that is used in the test (use "N/A" if the test doesn't use a model)
(If applicable) I have added the marker @pytest.mark.integration("<feature-being-tested>") to the new tests which I have added, to specify the feature that will be tested
(If applicable) I have added the marker @pytest.mark.multinode(<integer-num-nodes>) to the new tests which I have added, to specify the number of nodes used on a multi-node test
(If applicable) I have added the marker @pytest.mark.processor(<"cpu"/"gpu"/"eia"/"neuron">) to the new tests which I have added, if a test is specifically applicable to only one processor type

EIA/NEURON Checklist

When creating a PR:

I've modified src/config/build_config.py in my PR branch by setting ENABLE_EI_MODE = True or ENABLE_NEURON_MODE = True

When PR is reviewed and ready to be merged:

I've reverted the code change on the config file mentioned above

Benchmark Checklist

When creating a PR:

I've modified src/config/test_config.py in my PR branch by setting ENABLE_BENCHMARK_DEV_MODE = True

When PR is reviewed and ready to be merged:

I've reverted the code change on the config file mentioned above

Reviewer Checklist

For reviewer, before merging, please cross-check:

I've verified the code change on the config file mentioned above has already been reverted

Description:

Tests run:

DLC image/dockerfile:

Additional context:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

jeet4320 · 2021-05-05T03:30:23Z

pytorch/buildspec.yml

    docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /example, /Dockerfile., *DEVICE_TYPE ]
    context:
      <<: *TRAINING_CONTEXT
-  BuildCPUPTInferencePy3DockerImage:


It will be reverted, skipping inference as training has fix

tejaschumbalkar · 2021-05-05T03:54:45Z

Changes looks good. Please confirm the same with EFA team as well

jeet4320 · 2021-05-05T04:19:17Z

Adding RDMAV_FORK_SAFE fixes tests, and there is no other way to fix those if EFA is installed

karan6181 · 2021-05-05T04:20:43Z

Does the training log emits that the flag RDMAV_FORK_SAFE has been set ?

jeet4320 · 2021-05-05T04:22:30Z

Will confirm by pulling DLC image.

karan6181 · 2021-05-05T04:35:33Z

Also please check the EFA has been selected as a provider. This (NCCL INFO NET/OFI Selected Provider is efa) is the line you should look at. Also, ensure singlenode and multi-node test passes.

jeet4320 · 2021-05-05T04:37:14Z

Verified image

root@5e4759b5785c:/# echo $RDMAV_FORK_SAFE
1

mansimane · 2021-05-05T05:22:08Z

Changes look good to me. One question - Do you know why we do not have benchmark test for pytorch similar to tensorflow -test/dlc_tests/benchmark/sagemaker/tensorflow/training/resources/tf_sm_benchmark.py ? Is equivalent test present in some other folder?

jeet4320 · 2021-05-05T05:29:14Z

For PT, its not there

jeet4320 · 2021-05-05T05:41:10Z

src/config/build_config.py

 ENABLE_NEURON_MODE = False
 # Frameworks for which you want to disable both builds and tests
-DISABLE_FRAMEWORK_TESTS = []
+DISABLE_FRAMEWORK_TESTS = ["mxnet", "huggingface_pytorch", "huggingface_tensorflow"]


to be reverted changes in this file

jeet4320 · 2021-05-05T15:59:20Z

All tests for TF24.1 and PT1.8.1 Passed

…ng-containers into add-RDMAV_FORK_SAFE * 'add-RDMAV_FORK_SAFE' of github.com:jeet4320/deep-learning-containers: Disable new builds Run only benchmark tests

jeet4320 · 2021-05-06T19:15:31Z

test/testrunner.py

    long_name = framework_name
    short_name = frameworks[long_name]
    codebuild_version = os.getenv("CODEBUILD_RESOLVED_SOURCE_VERSION")[0:7]
-    num_nodes = 1 if is_pr_context() else 3 if long_name != "pytorch" else 4


will be reverted, this is needed to run multinode eks on PR

jeet4320 · 2021-05-06T19:15:55Z

test/dlc_tests/benchmark/sagemaker/tensorflow/training/resources/tf_sm_benchmark.py

                  "-x HOROVOD_HIERARCHICAL_ALLREDUCE=1 "
                  "-x HOROVOD_FUSION_THRESHOLD=16777216 "
-                  "-x TF_CPP_MIN_LOG_LEVEL=3 "
-                  "-x RDMAV_FORK_SAFE=1"


removing RDMAV_FORK_SAFE from tests

jeet4320 · 2021-05-06T22:34:56Z

FAILED integration/sagemaker/test_mnist.py::test_smdataparallel_smmodelparallel_mnist[gpu-3]

rerunning it as a single test

jeet4320 · 2021-05-06T23:27:06Z

test/sagemaker_tests/tensorflow/tensorflow2_training/integration/conftest.py

        efa_tests = [mark for mark in item.iter_markers("efa")]
        if not efa_tests:
            pytest.skip("Skipping non-efa tests")
+        if efa_tests and are_efa_tests_disabled():


will remove it, by mistake added it

saimidu

Approved

saimidu · 2021-05-06T23:23:18Z

pytorch/training/docker/1.8/py3/cu111/Dockerfile.gpu

  && rm -rf /tmp/efa \
  && rm -rf /tmp/aws-efa-installer-${EFA_VERSION}.tar.gz

+RUN echo "pml = ob1" >> /opt/amazon/openmpi/etc/openmpi-mca-params.conf


nit: [ADDRESS LATER] Use OPEN_MPI_PATH because we have already assigned /opt/amazon/openmpi to an ARG.

saimidu · 2021-05-06T23:23:26Z

tensorflow/training/docker/2.4/py3/cu110/Dockerfile.gpu

 RUN echo NCCL_DEBUG=INFO >> /etc/nccl.conf
-
-ENV LD_LIBRARY_PATH=$OPEN_MPI_PATH/lib:$LD_LIBRARY_PATH
+RUN echo "pml = ob1" >> /opt/amazon/openmpi/etc/openmpi-mca-params.conf


nit: [ADDRESS LATER] Use OPEN_MPI_PATH because we have already assigned /opt/amazon/openmpi to an ARG.

saimidu · 2021-05-06T23:28:28Z

test/test_utils/sagemaker.py

+    if job_type == "training":
+        if framework == "tensorflow":
+            if framework_major_version == "2":
+                integration_path = f"integration/sagemaker/test_mnist.py::test_smdataparallel_smmodelparallel_mnist"
+            else:
+                integration_path = f"integration/sagemaker/test_tuning_model_dir.py"


This must be reverted.

saimidu · 2021-05-06T23:28:40Z

test/sagemaker_tests/tensorflow/tensorflow2_training/integration/conftest.py

+        if efa_tests and are_efa_tests_disabled():
+            pytest.skip('Skipping EFA tests as EFA tests are disabled.')


This must be reverted.

jeet4320 added 2 commits May 4, 2021 19:58

add RDMAV_FORK_SAFE

238c347

pytorch change

ba3413a

jeet4320 changed the title ~~Add RDMAV_FORK_SAFE~~ [pytorch][tensorflow][build][test] Add RDMAV_FORK_SAFE May 5, 2021

just build training and remove RDMAV_FORK_SAFE from tests

cad43a0

jeet4320 commented May 5, 2021

View reviewed changes

akhilmehra previously approved these changes May 5, 2021

View reviewed changes

jeet4320 commented May 5, 2021

View reviewed changes

Run only benchmark tests

c1d6266

saimidu dismissed akhilmehra’s stale review via c1d6266 May 5, 2021 17:22

saimidu and others added 11 commits May 5, 2021 10:23

Disable new builds

4f35834

update ld library path for efa

7a491cd

Merge branch 'add-RDMAV_FORK_SAFE' of github.com:jeet4320/deep-learni…

7922237

…ng-containers into add-RDMAV_FORK_SAFE * 'add-RDMAV_FORK_SAFE' of github.com:jeet4320/deep-learning-containers: Disable new builds Run only benchmark tests

config

bd07c92

fix slash

c6fd4e5

run all tests

162eeaa

run skipped eks test

0a9d4b1

run skipped eks test

7afa1a5

run skipped eks test pt and tf

79fb5f0

remove missed RDMAV_FORK_SAFE

d5f0113

update pml ob1 to openmpi

857664d

jeet4320 commented May 6, 2021

View reviewed changes

indhub previously approved these changes May 6, 2021

View reviewed changes

run failing test_smdataparallel_smmodelparallel_mnist

69bcc9d

jeet4320 dismissed indhub’s stale review via 69bcc9d May 6, 2021 22:45

jeet4320 commented May 6, 2021

View reviewed changes

saimidu reviewed May 6, 2021

View reviewed changes

revert

2363faf

saimidu approved these changes May 6, 2021

View reviewed changes

jeet4320 merged commit d6f0e97 into aws:master May 6, 2021

		if efa_tests and are_efa_tests_disabled():
		pytest.skip('Skipping EFA tests as EFA tests are disabled.')

[pytorch][tensorflow][build][test] Add RDMAV_FORK_SAFE #1090

[pytorch][tensorflow][build][test] Add RDMAV_FORK_SAFE #1090

Uh oh!

Conversation

jeet4320 commented May 5, 2021

PR Checklist

Pytest Marker Checklist

EIA/NEURON Checklist

Benchmark Checklist

Reviewer Checklist

Uh oh!

jeet4320 May 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tejaschumbalkar commented May 5, 2021

Uh oh!

jeet4320 commented May 5, 2021

Uh oh!

karan6181 commented May 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeet4320 commented May 5, 2021

Uh oh!

karan6181 commented May 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeet4320 commented May 5, 2021

Uh oh!

mansimane commented May 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeet4320 commented May 5, 2021

Uh oh!

jeet4320 May 5, 2021

Choose a reason for hiding this comment

Uh oh!

jeet4320 commented May 5, 2021

Uh oh!

jeet4320 May 6, 2021

Choose a reason for hiding this comment

Uh oh!

jeet4320 May 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeet4320 commented May 6, 2021

Uh oh!

jeet4320 May 6, 2021

Choose a reason for hiding this comment

Uh oh!

saimidu left a comment

Choose a reason for hiding this comment

Uh oh!

saimidu May 6, 2021

Choose a reason for hiding this comment

Uh oh!

saimidu May 6, 2021

Choose a reason for hiding this comment

Uh oh!

saimidu May 6, 2021

Choose a reason for hiding this comment

Uh oh!

saimidu May 6, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

jeet4320 May 5, 2021 •

edited

Loading

karan6181 commented May 5, 2021 •

edited

Loading

karan6181 commented May 5, 2021 •

edited

Loading

mansimane commented May 5, 2021 •

edited

Loading

jeet4320 May 6, 2021 •

edited

Loading