Skip to content

[huggingface_pytorch, huggingface_tensorflow][build] Huggingface inference DLC #1077

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 58 commits into from

Conversation

philschmid
Copy link
Contributor

Issue #, if available:

PR Checklist

  • I've prepended PR tag with frameworks/job this applies to : [mxnet, tensorflow, pytorch] | [ei/neuron] | [build] | [test] | [benchmark] | [ec2, ecs, eks, sagemaker]
  • (If applicable) I've documented below the DLC image/dockerfile this relates to
  • (If applicable) I've documented below the tests I've run on the DLC image
  • (If applicable) I've reviewed the licenses of updated and new binaries and their dependencies to make sure all licenses are on the Apache Software Foundation Third Party License Policy Category A or Category B license list. See https://www.apache.org/legal/resolved.html.
  • (If applicable) I've scanned the updated and new binaries to make sure they do not have vulnerabilities associated with them.

Pytest Marker Checklist

  • (If applicable) I have added the marker @pytest.mark.model("<model-type>") to the new tests which I have added, to specify the Deep Learning model that is used in the test (use "N/A" if the test doesn't use a model)
  • (If applicable) I have added the marker @pytest.mark.integration("<feature-being-tested>") to the new tests which I have added, to specify the feature that will be tested
  • (If applicable) I have added the marker @pytest.mark.multinode(<integer-num-nodes>) to the new tests which I have added, to specify the number of nodes used on a multi-node test
  • (If applicable) I have added the marker @pytest.mark.processor(<"cpu"/"gpu"/"eia"/"neuron">) to the new tests which I have added, if a test is specifically applicable to only one processor type

EIA/NEURON Checklist

  • When creating a PR:
  • I've modified src/config/build_config.py in my PR branch by setting ENABLE_EI_MODE = True or ENABLE_NEURON_MODE = True
  • When PR is reviewed and ready to be merged:
  • I've reverted the code change on the config file mentioned above

Benchmark Checklist

  • When creating a PR:
  • I've modified src/config/test_config.py in my PR branch by setting ENABLE_BENCHMARK_DEV_MODE = True
  • When PR is reviewed and ready to be merged:
  • I've reverted the code change on the config file mentioned above

Reviewer Checklist

  • For reviewer, before merging, please cross-check:
  • I've verified the code change on the config file mentioned above has already been reverted

Description:

This PR introduces new Hugging Face Deep Learning Container for Inference. It contains CPU and GPU images for PyTorch and TensorFlow. I also tried to adjust the buildspec.yaml.

Looking forward on your feedback.

Tests run:

DLC image/dockerfile:

The Hugging Face DLCs

Additional context:

docker build is not yet possible, since the sagemaker_huggingface_inference_toolkit is not released. This PR but is ready for review.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

tag_python_version: &TAG_PYTHON_VERSION py36
os_version: &OS_VERSION ubuntu18.04
transformers_version: &TRANSFORMERS_VERSION 4.6.0
inference_toolkit_version: &INFERENCE_TOOLKIT_VERSION 1.0.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should this in the dockerfile or in the buildpsec.yaml

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Question of usability of Dockerfiles vs code manageability as this version number is used in all the containers. Would recommend leaving it here for now and moving it into Dockerfiles if need arises.

@@ -0,0 +1,5 @@
vmargs=-XX:+UseContainerSupport -XX:InitialRAMPercentage=8.0 -XX:MaxRAMPercentage=10.0 -XX:-UseLargePages -XX:+UseG1GC -XX:+ExitOnOutOfMemoryError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-XX:-UseContainerSupport? Did we test this configuration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we address this comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have done all the current testing with this configuration so far.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the suggestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace the option -XX:+UseContainerSupport above with -XX:-UseContainerSupport .. + changes to -

tag_python_version: &TAG_PYTHON_VERSION py36
os_version: &OS_VERSION ubuntu18.04
transformers_version: &TRANSFORMERS_VERSION 4.6.0
inference_toolkit_version: &INFERENCE_TOOLKIT_VERSION 1.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Question of usability of Dockerfiles vs code manageability as this version number is used in all the containers. Would recommend leaving it here for now and moving it into Dockerfiles if need arises.

lxning and others added 3 commits June 1, 2021 10:46
* update TS to 0.4.0 for inference PT1.8.1

* enable safety test

* revert back

Co-authored-by: Tejas Chumbalkar <[email protected]>
@philschmid
Copy link
Contributor Author

@saimidu can you give it a look at why the building fails?

@@ -9,6 +9,7 @@ ARG MMS_VERSION=1.1.2
ARG PYTHON=python3
ARG PYTHON_VERSION=3.6.10
ARG HEALTH_CHECK_VERSION=1.7.0
ARG OPENSSL_VERSION=1.1.1k
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we modifying MXNet containers as a part of this PR? If this is unintentional, could we revert this back?

@@ -15,8 +15,7 @@ ARG MX_URL=https://aws-mxnet-pypi.s3-us-west-2.amazonaws.com/1.6.0/aws_mxnet_mkl
ARG PYTHON=python
ARG PYTHON_PIP=python-pip
ARG PIP=pip

ARG OPENSSL_VERSION=1.1.1g
ARG OPENSSL_VERSION=1.1.1k
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Could we revert this change. If this was intentional, we could pull it in as a separate PR.

@@ -1,8 +1,9 @@
account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
region: &REGION <set-$REGION-in-environment>
framework: &FRAMEWORK mxnet
version: &VERSION 1.5.1
os_version: &OS_VERSION ubuntu16.04
version: &VERSION 1.8.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change required for HF containers?

@@ -0,0 +1,145 @@
FROM ubuntu:18.04
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change required for HF containers?

Copy link
Contributor

@vdantu vdantu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR seems to have unintentional non-HF changing. We might need to rebase the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.