-
Notifications
You must be signed in to change notification settings - Fork 522
[pytorch][tensorflow][build][test] Add RDMAV_FORK_SAFE #1090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /example, /Dockerfile., *DEVICE_TYPE ] | ||
| context: | ||
| <<: *TRAINING_CONTEXT | ||
| BuildCPUPTInferencePy3DockerImage: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be reverted, skipping inference as training has fix
|
Changes looks good. Please confirm the same with EFA team as well |
|
Adding RDMAV_FORK_SAFE fixes tests, and there is no other way to fix those if EFA is installed |
|
Does the training log emits that the flag |
|
Will confirm by pulling DLC image. |
|
Also please check the EFA has been selected as a provider. This ( |
|
Verified image |
|
Changes look good to me. One question - Do you know why we do not have benchmark test for pytorch similar to tensorflow - |
|
For PT, its not there |
src/config/build_config.py
Outdated
| ENABLE_NEURON_MODE = False | ||
| # Frameworks for which you want to disable both builds and tests | ||
| DISABLE_FRAMEWORK_TESTS = [] | ||
| DISABLE_FRAMEWORK_TESTS = ["mxnet", "huggingface_pytorch", "huggingface_tensorflow"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be reverted changes in this file
|
All tests for TF24.1 and PT1.8.1 Passed |
…ng-containers into add-RDMAV_FORK_SAFE * 'add-RDMAV_FORK_SAFE' of github.com:jeet4320/deep-learning-containers: Disable new builds Run only benchmark tests
| long_name = framework_name | ||
| short_name = frameworks[long_name] | ||
| codebuild_version = os.getenv("CODEBUILD_RESOLVED_SOURCE_VERSION")[0:7] | ||
| num_nodes = 1 if is_pr_context() else 3 if long_name != "pytorch" else 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will be reverted, this is needed to run multinode eks on PR
| "-x HOROVOD_HIERARCHICAL_ALLREDUCE=1 " | ||
| "-x HOROVOD_FUSION_THRESHOLD=16777216 " | ||
| "-x TF_CPP_MIN_LOG_LEVEL=3 " | ||
| "-x RDMAV_FORK_SAFE=1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing RDMAV_FORK_SAFE from tests
|
FAILED integration/sagemaker/test_mnist.py::test_smdataparallel_smmodelparallel_mnist[gpu-3] rerunning it as a single test |
| efa_tests = [mark for mark in item.iter_markers("efa")] | ||
| if not efa_tests: | ||
| pytest.skip("Skipping non-efa tests") | ||
| if efa_tests and are_efa_tests_disabled(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will remove it, by mistake added it
saimidu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved
| && rm -rf /tmp/efa \ | ||
| && rm -rf /tmp/aws-efa-installer-${EFA_VERSION}.tar.gz | ||
|
|
||
| RUN echo "pml = ob1" >> /opt/amazon/openmpi/etc/openmpi-mca-params.conf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: [ADDRESS LATER] Use OPEN_MPI_PATH because we have already assigned /opt/amazon/openmpi to an ARG.
| RUN echo NCCL_DEBUG=INFO >> /etc/nccl.conf | ||
|
|
||
| ENV LD_LIBRARY_PATH=$OPEN_MPI_PATH/lib:$LD_LIBRARY_PATH | ||
| RUN echo "pml = ob1" >> /opt/amazon/openmpi/etc/openmpi-mca-params.conf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: [ADDRESS LATER] Use OPEN_MPI_PATH because we have already assigned /opt/amazon/openmpi to an ARG.
test/test_utils/sagemaker.py
Outdated
| if job_type == "training": | ||
| if framework == "tensorflow": | ||
| if framework_major_version == "2": | ||
| integration_path = f"integration/sagemaker/test_mnist.py::test_smdataparallel_smmodelparallel_mnist" | ||
| else: | ||
| integration_path = f"integration/sagemaker/test_tuning_model_dir.py" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must be reverted.
| if efa_tests and are_efa_tests_disabled(): | ||
| pytest.skip('Skipping EFA tests as EFA tests are disabled.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must be reverted.
Issue #, if available:
PR Checklist
Pytest Marker Checklist
@pytest.mark.model("<model-type>")to the new tests which I have added, to specify the Deep Learning model that is used in the test (use"N/A"if the test doesn't use a model)@pytest.mark.integration("<feature-being-tested>")to the new tests which I have added, to specify the feature that will be tested@pytest.mark.multinode(<integer-num-nodes>)to the new tests which I have added, to specify the number of nodes used on a multi-node test@pytest.mark.processor(<"cpu"/"gpu"/"eia"/"neuron">)to the new tests which I have added, if a test is specifically applicable to only one processor typeEIA/NEURON Checklist
src/config/build_config.pyin my PR branch by settingENABLE_EI_MODE = TrueorENABLE_NEURON_MODE = TrueBenchmark Checklist
src/config/test_config.pyin my PR branch by settingENABLE_BENCHMARK_DEV_MODE = TrueReviewer Checklist
Description:
Tests run:
DLC image/dockerfile:
Additional context:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.