[huggingface_pytorch, huggingface_tensorflow][build] Huggingface inference DLC #1077

philschmid · 2021-04-29T08:50:40Z

Issue #, if available:

PR Checklist

I've prepended PR tag with frameworks/job this applies to : [mxnet, tensorflow, pytorch] | [ei/neuron] | [build] | [test] | [benchmark] | [ec2, ecs, eks, sagemaker]
(If applicable) I've documented below the DLC image/dockerfile this relates to
(If applicable) I've documented below the tests I've run on the DLC image
(If applicable) I've reviewed the licenses of updated and new binaries and their dependencies to make sure all licenses are on the Apache Software Foundation Third Party License Policy Category A or Category B license list. See https://www.apache.org/legal/resolved.html.
(If applicable) I've scanned the updated and new binaries to make sure they do not have vulnerabilities associated with them.

Pytest Marker Checklist

(If applicable) I have added the marker @pytest.mark.model("<model-type>") to the new tests which I have added, to specify the Deep Learning model that is used in the test (use "N/A" if the test doesn't use a model)
(If applicable) I have added the marker @pytest.mark.integration("<feature-being-tested>") to the new tests which I have added, to specify the feature that will be tested
(If applicable) I have added the marker @pytest.mark.multinode(<integer-num-nodes>) to the new tests which I have added, to specify the number of nodes used on a multi-node test
(If applicable) I have added the marker @pytest.mark.processor(<"cpu"/"gpu"/"eia"/"neuron">) to the new tests which I have added, if a test is specifically applicable to only one processor type

EIA/NEURON Checklist

When creating a PR:

I've modified src/config/build_config.py in my PR branch by setting ENABLE_EI_MODE = True or ENABLE_NEURON_MODE = True

When PR is reviewed and ready to be merged:

I've reverted the code change on the config file mentioned above

Benchmark Checklist

When creating a PR:

I've modified src/config/test_config.py in my PR branch by setting ENABLE_BENCHMARK_DEV_MODE = True

When PR is reviewed and ready to be merged:

I've reverted the code change on the config file mentioned above

Reviewer Checklist

For reviewer, before merging, please cross-check:

I've verified the code change on the config file mentioned above has already been reverted

Description:

This PR introduces new Hugging Face Deep Learning Container for Inference. It contains CPU and GPU images for PyTorch and TensorFlow. I also tried to adjust the buildspec.yaml.

Looking forward on your feedback.

Tests run:

DLC image/dockerfile:

The Hugging Face DLCs

Additional context:

docker build is not yet possible, since the sagemaker_huggingface_inference_toolkit is not released. This PR but is ready for review.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

philschmid · 2021-04-29T08:58:16Z

huggingface/pytorch/buildspec.yml

+    tag_python_version: &TAG_PYTHON_VERSION py36
+    os_version: &OS_VERSION ubuntu18.04
+    transformers_version: &TRANSFORMERS_VERSION 4.6.0
+    inference_toolkit_version: &INFERENCE_TOOLKIT_VERSION 1.0.0


Not sure if we should this in the dockerfile or in the buildpsec.yaml

Good point. Question of usability of Dockerfiles vs code manageability as this version number is used in all the containers. Would recommend leaving it here for now and moving it into Dockerfiles if need arises.

…00 and version for git script (aws#1069)

… to support EFA (aws#1075) * [tensorflow, pytorch][build][sagemaker] Updated smdataparallel binary to support EFA Co-authored-by: Jeetendra Patil <[email protected]>

* [test] Fix smclarify test * Fix failing pytorch sanity test

vdantu · 2021-05-03T00:22:36Z

huggingface/build_artifacts/inference/config.properties

@@ -0,0 +1,5 @@
+vmargs=-XX:+UseContainerSupport -XX:InitialRAMPercentage=8.0 -XX:MaxRAMPercentage=10.0 -XX:-UseLargePages -XX:+UseG1GC -XX:+ExitOnOutOfMemoryError


-XX:-UseContainerSupport? Did we test this configuration?

Could we address this comment?

Yes, I have done all the current testing with this configuration so far.

can you add the suggestion?

Replace the option -XX:+UseContainerSupport above with -XX:-UseContainerSupport .. + changes to -

vdantu · 2021-05-03T00:26:45Z

huggingface/pytorch/buildspec.yml

+    tag_python_version: &TAG_PYTHON_VERSION py36
+    os_version: &OS_VERSION ubuntu18.04
+    transformers_version: &TRANSFORMERS_VERSION 4.6.0
+    inference_toolkit_version: &INFERENCE_TOOLKIT_VERSION 1.0.0


Good point. Question of usability of Dockerfiles vs code manageability as this version number is used in all the containers. Would recommend leaving it here for now and moving it into Dockerfiles if need arises.

…to all SM remote tests (aws#1089)

…E to 1 (aws#1090)

…pport (aws#1095)

…nd PT1.8.1 (aws#1081) Co-authored-by: Sai Parthasarathy Miduthuri <[email protected]> Co-authored-by: Tejas Chumbalkar <[email protected]>

…s#1098)

…bug 1.0.9 (aws#1096)

…test script from ecr image (aws#1104)

* update TS to 0.4.0 for inference PT1.8.1 * enable safety test * revert back Co-authored-by: Tejas Chumbalkar <[email protected]>

philschmid · 2021-06-02T13:09:59Z

@saimidu can you give it a look at why the building fails?

…s#1139)

…Cs (aws#1135)

…ws#1141)

vdantu · 2021-06-08T15:41:15Z

mxnet/inference/docker/1.5.1/py3/Dockerfile.eia

@@ -9,6 +9,7 @@ ARG MMS_VERSION=1.1.2
 ARG PYTHON=python3
 ARG PYTHON_VERSION=3.6.10
 ARG HEALTH_CHECK_VERSION=1.7.0
+ARG OPENSSL_VERSION=1.1.1k


Why are we modifying MXNet containers as a part of this PR? If this is unintentional, could we revert this back?

vdantu · 2021-06-08T15:41:44Z

mxnet/inference/docker/1.6.0/py2/Dockerfile.cpu

@@ -15,8 +15,7 @@ ARG MX_URL=https://aws-mxnet-pypi.s3-us-west-2.amazonaws.com/1.6.0/aws_mxnet_mkl
 ARG PYTHON=python
 ARG PYTHON_PIP=python-pip
 ARG PIP=pip
-
-ARG OPENSSL_VERSION=1.1.1g
+ARG OPENSSL_VERSION=1.1.1k


Same here. Could we revert this change. If this was intentional, we could pull it in as a separate PR.

vdantu · 2021-06-08T15:43:32Z

mxnet/buildspec-neuron.yml

@@ -1,8 +1,9 @@
  account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
  region: &REGION <set-$REGION-in-environment>
  framework: &FRAMEWORK mxnet
-  version: &VERSION 1.5.1
-  os_version: &OS_VERSION ubuntu16.04
+  version: &VERSION 1.8.0


Is this change required for HF containers?

vdantu · 2021-06-08T15:43:48Z

mxnet/inference/docker/1.8/py3/Dockerfile.neuron

@@ -0,0 +1,145 @@
+FROM ubuntu:18.04


Is this change required for HF containers?

vdantu

The PR seems to have unintentional non-HF changing. We might need to rebase the PR.

philschmid added 6 commits April 29, 2021 10:21

added pytorch DLCs

aeaa267

added TF container and cleand PT, move build_artifcats

29a7346

moved build artifcats to separate folder

b4e67ca

removed todos

5a48fde

added inference buildspec

69c36f8

added build spec for tf

392d693

philschmid mentioned this pull request Apr 29, 2021

[HuggingFace Inference] Add new base HuggingFace inference containers #1040

Closed

14 tasks

philschmid commented Apr 29, 2021

View reviewed changes

mansimane and others added 10 commits April 29, 2021 10:37

[test][huggingface_pytorch] Updated number of tests in smmp test to 5…

d503b84

…00 and version for git script (aws#1069)

[tensorflow][build][test] update TF2.3 for pillow to 8.2.0 (aws#1072)

ec861ef

[tensorflow, pytorch][build][sagemaker] Updated smdataparallel binary…

faa3383

… to support EFA (aws#1075) * [tensorflow, pytorch][build][sagemaker] Updated smdataparallel binary to support EFA Co-authored-by: Jeetendra Patil <[email protected]>

update pillow (aws#1079)

6022443

[tensforlow] Release TF2.3 cuda110 training cpu and gpu (aws#1078)

b5931e9

[test] Fix smclarify test flakiness (aws#1082)

b986d13

* [test] Fix smclarify test * Fix failing pytorch sanity test

Fix to execute efa tests on mainline (aws#1083)

80c6494

[sagemaker] Fix typo in sagemaker test code (aws#1085)

ba5df80

Add back efa configs to SM tests (aws#1086)

1d92ebd

Skip temporarily to revert it (aws#1087)

caa2d9b

vdantu reviewed May 3, 2021

View reviewed changes

jeet4320 and others added 11 commits May 4, 2021 13:36

[benchmark][efa][sagemaker] Fix TF2 SM benchmark and add EFA configs …

e8b9565

…to all SM remote tests (aws#1089)

[pytorch][tensorflow][build][test] TF2.4.1 PT1.8.1 Set RDMAV_FORK_SAF…

d6f0e97

…E to 1 (aws#1090)

[test] Add remote override for tests (aws#1093)

e2e8994

[pytorch][tensorflow][build][test] Build OpenMPI without libfabric su…

33037e9

…pport (aws#1095)

[tensorflow][pytorch][release][training] update release for TF2.4.1 a…

25ad625

…nd PT1.8.1 (aws#1081) Co-authored-by: Sai Parthasarathy Miduthuri <[email protected]> Co-authored-by: Tejas Chumbalkar <[email protected]>

[release] release tf2.3.2 dlc images (aws#1100)

2cf5475

[release] tf 2.3.2 release wave 2 (aws#1102)

c1404e6

[test][benchmark][sagemaker][tensorflow,mxnet] Fix log file names (aw…

8469a38

…s#1098)

[tensorflow, pytorch][build] Update TF 2.4 and PT 1.8 DLC to use smde…

7041cf1

…bug 1.0.9 (aws#1096)

[test][sagemaker][huggingface] Deriving version for transormers SMDP …

1b38f43

…test script from ecr image (aws#1104)

[test][pytorch][ec2] Fix flakiness in NCCL Version test (aws#1107)

52ad9fa

lxning and others added 3 commits June 1, 2021 10:46

update TS to 0.4.0 for inference PT1.8.1 (aws#1124)

45a7c29

* update TS to 0.4.0 for inference PT1.8.1 * enable safety test * revert back Co-authored-by: Tejas Chumbalkar <[email protected]>

[release] Release PT 1.7.1 for Neuron UL18 (aws#1131)

3355daf

added beta version of toolkit

22afaf1

saimidu and others added 19 commits June 3, 2021 10:59

[release] Add PT 1.6 GPU cu110 and PT 1.8 (aws#1136)

4b34bdf

[test][sagemaker][pytorch] Run smddp_smdmp test on correct region (aw…

0afc92c

…s#1139)

[test][sagemaker][pytorch] Add us-east-2 to no_p3 regions (aws#1140)

a0c30df

[huggingface_pytorch] Safety check on PT 1.6 (aws#1133)

da71db7

[test][ec2][pytorch] Run NCCL version test only on PT 1.7 or above DL…

9d40d45

…Cs (aws#1135)

[test][sagemaker][huggingface] Add placeholder tests for inference (a…

cd17e8b

…ws#1141)

[test][sagemaker] Add ap-northeast-2 to NO_PR_REGIONS (aws#1142)

7097ce8

[release] Add PT 1.6 HF 4.6.1 to release images (aws#1143)

48c0e36

[test][sagemaker] Add ap-northeast-1 to NO_P3_REGIONS (aws#1144)

9da07a4

added pytorch DLCs

90dfb6f

added TF container and cleand PT, move build_artifcats

71a1d46

moved build artifcats to separate folder

a5cd5dd

removed todos

9bde308

added inference buildspec

c1bfe80

added build spec for tf

ec6c250

added beta version of toolkit

03e2dc7

rebasing

bbdb15d

local_serving test pytorch

a7e8eee

cpu and gpu tests

dcd0e21

vdantu reviewed Jun 8, 2021

View reviewed changes

mxnet/inference/docker/1.8/py3/Dockerfile.neuron

@@ -0,0 +1,145 @@

FROM ubuntu:18.04

Copy link

Contributor

vdantu Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change required for HF containers?

vdantu suggested changes Jun 8, 2021

View reviewed changes

philschmid closed this Jun 8, 2021

davidthomas426 mentioned this pull request Jan 22, 2023

add vmargs=-XX:-UseContainerSupport in config aws/sagemaker-pytorch-inference-toolkit#136

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[huggingface_pytorch, huggingface_tensorflow][build] Huggingface inference DLC #1077

[huggingface_pytorch, huggingface_tensorflow][build] Huggingface inference DLC #1077

Uh oh!

philschmid commented Apr 29, 2021

Uh oh!

philschmid Apr 29, 2021

Uh oh!

vdantu May 3, 2021

Uh oh!

vdantu May 3, 2021

Uh oh!

vdantu Jun 1, 2021

Uh oh!

philschmid Jun 1, 2021

Uh oh!

philschmid Jun 2, 2021

Uh oh!

vdantu Jun 8, 2021

Uh oh!

vdantu May 3, 2021

Uh oh!

philschmid commented Jun 2, 2021

Uh oh!

vdantu Jun 8, 2021

Uh oh!

vdantu Jun 8, 2021

Uh oh!

vdantu Jun 8, 2021

Uh oh!

vdantu Jun 8, 2021

Uh oh!

vdantu left a comment

Uh oh!

Uh oh!

		@@ -0,0 +1,5 @@
		vmargs=-XX:+UseContainerSupport -XX:InitialRAMPercentage=8.0 -XX:MaxRAMPercentage=10.0 -XX:-UseLargePages -XX:+UseG1GC -XX:+ExitOnOutOfMemoryError

[huggingface_pytorch, huggingface_tensorflow][build] Huggingface inference DLC #1077

[huggingface_pytorch, huggingface_tensorflow][build] Huggingface inference DLC #1077

Uh oh!

Conversation

philschmid commented Apr 29, 2021

PR Checklist

Pytest Marker Checklist

EIA/NEURON Checklist

Benchmark Checklist

Reviewer Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

philschmid commented Jun 2, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vdantu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!