-
Notifications
You must be signed in to change notification settings - Fork 502
[huggingface_pytorch, huggingface_tensorflow][build] Huggingface inference DLC #1077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
huggingface/pytorch/buildspec.yml
Outdated
tag_python_version: &TAG_PYTHON_VERSION py36 | ||
os_version: &OS_VERSION ubuntu18.04 | ||
transformers_version: &TRANSFORMERS_VERSION 4.6.0 | ||
inference_toolkit_version: &INFERENCE_TOOLKIT_VERSION 1.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we should this in the dockerfile
or in the buildpsec.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Question of usability of Dockerfiles vs code manageability as this version number is used in all the containers. Would recommend leaving it here for now and moving it into Dockerfiles if need arises.
…00 and version for git script (aws#1069)
… to support EFA (aws#1075) * [tensorflow, pytorch][build][sagemaker] Updated smdataparallel binary to support EFA Co-authored-by: Jeetendra Patil <[email protected]>
* [test] Fix smclarify test * Fix failing pytorch sanity test
@@ -0,0 +1,5 @@ | |||
vmargs=-XX:+UseContainerSupport -XX:InitialRAMPercentage=8.0 -XX:MaxRAMPercentage=10.0 -XX:-UseLargePages -XX:+UseG1GC -XX:+ExitOnOutOfMemoryError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-XX:-UseContainerSupport? Did we test this configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we address this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have done all the current testing with this configuration so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace the option -XX:+UseContainerSupport
above with -XX:-UseContainerSupport
.. +
changes to -
huggingface/pytorch/buildspec.yml
Outdated
tag_python_version: &TAG_PYTHON_VERSION py36 | ||
os_version: &OS_VERSION ubuntu18.04 | ||
transformers_version: &TRANSFORMERS_VERSION 4.6.0 | ||
inference_toolkit_version: &INFERENCE_TOOLKIT_VERSION 1.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Question of usability of Dockerfiles vs code manageability as this version number is used in all the containers. Would recommend leaving it here for now and moving it into Dockerfiles if need arises.
…to all SM remote tests (aws#1089)
…nd PT1.8.1 (aws#1081) Co-authored-by: Sai Parthasarathy Miduthuri <[email protected]> Co-authored-by: Tejas Chumbalkar <[email protected]>
…test script from ecr image (aws#1104)
* update TS to 0.4.0 for inference PT1.8.1 * enable safety test * revert back Co-authored-by: Tejas Chumbalkar <[email protected]>
@saimidu can you give it a look at why the building fails? |
@@ -9,6 +9,7 @@ ARG MMS_VERSION=1.1.2 | |||
ARG PYTHON=python3 | |||
ARG PYTHON_VERSION=3.6.10 | |||
ARG HEALTH_CHECK_VERSION=1.7.0 | |||
ARG OPENSSL_VERSION=1.1.1k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we modifying MXNet containers as a part of this PR? If this is unintentional, could we revert this back?
@@ -15,8 +15,7 @@ ARG MX_URL=https://aws-mxnet-pypi.s3-us-west-2.amazonaws.com/1.6.0/aws_mxnet_mkl | |||
ARG PYTHON=python | |||
ARG PYTHON_PIP=python-pip | |||
ARG PIP=pip | |||
|
|||
ARG OPENSSL_VERSION=1.1.1g | |||
ARG OPENSSL_VERSION=1.1.1k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Could we revert this change. If this was intentional, we could pull it in as a separate PR.
@@ -1,8 +1,9 @@ | |||
account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment> | |||
region: ®ION <set-$REGION-in-environment> | |||
framework: &FRAMEWORK mxnet | |||
version: &VERSION 1.5.1 | |||
os_version: &OS_VERSION ubuntu16.04 | |||
version: &VERSION 1.8.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change required for HF containers?
@@ -0,0 +1,145 @@ | |||
FROM ubuntu:18.04 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change required for HF containers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR seems to have unintentional non-HF changing. We might need to rebase the PR.
Issue #, if available:
PR Checklist
Pytest Marker Checklist
@pytest.mark.model("<model-type>")
to the new tests which I have added, to specify the Deep Learning model that is used in the test (use"N/A"
if the test doesn't use a model)@pytest.mark.integration("<feature-being-tested>")
to the new tests which I have added, to specify the feature that will be tested@pytest.mark.multinode(<integer-num-nodes>)
to the new tests which I have added, to specify the number of nodes used on a multi-node test@pytest.mark.processor(<"cpu"/"gpu"/"eia"/"neuron">)
to the new tests which I have added, if a test is specifically applicable to only one processor typeEIA/NEURON Checklist
src/config/build_config.py
in my PR branch by settingENABLE_EI_MODE = True
orENABLE_NEURON_MODE = True
Benchmark Checklist
src/config/test_config.py
in my PR branch by settingENABLE_BENCHMARK_DEV_MODE = True
Reviewer Checklist
Description:
This PR introduces new Hugging Face Deep Learning Container for Inference. It contains CPU and GPU images for PyTorch and TensorFlow. I also tried to adjust the
buildspec.yaml
.Looking forward on your feedback.
Tests run:
DLC image/dockerfile:
The Hugging Face DLCs
Additional context:
docker build
is not yet possible, since thesagemaker_huggingface_inference_toolkit
is not released. This PR but is ready for review.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.