Skip to content

[PyTorch][Training][EC2][SageMaker]PyTorch 2.7.0 Currency Release #4799

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 90 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
90 commits
Select commit Hold shift + click to select a range
7284509
Pytorch 2.7 Training EC2/SM
Apr 28, 2025
c0397e0
do build true
Apr 28, 2025
6f6ce0a
do build true
Apr 28, 2025
09641b0
added tests
Apr 28, 2025
9bba30f
added build spec pointer:
Apr 28, 2025
0994e14
Moved to py3
Apr 28, 2025
ff1eda1
fixed docker repo name
Apr 28, 2025
0b67994
fixed args
Apr 28, 2025
5a29d7f
fixed args
Apr 28, 2025
89dbd84
fixed versions
Apr 29, 2025
0028897
Merge branch 'master' into pt-training2.7
Jyothirmaikottu Apr 29, 2025
7db4758
image build size
Apr 29, 2025
43249b2
Merge branch 'master' into pt-training2.7
Jyothirmaikottu Apr 29, 2025
151ff18
image build logic
Apr 29, 2025
c63d3db
revert max workers
Apr 30, 2025
2310202
Merge branch 'master' into pt-training2.7
Jyothirmaikottu Apr 30, 2025
df74472
Merge branch 'master' into pt-training2.7
Jyothirmaikottu Apr 30, 2025
9cf503b
fixed versions
Apr 30, 2025
a4f7b42
fixed pip
Apr 30, 2025
7e67580
fixed torch version
Apr 30, 2025
17ffc3f
fixed torch version
Apr 30, 2025
8fee3d8
fixed torch version
Apr 30, 2025
3041bb6
fixed flash attn
Apr 30, 2025
19e472a
fixed flash attn
May 1, 2025
95c7007
fixed s3 bucket issue
May 1, 2025
bc5152d
version changes
May 1, 2025
8a3f86f
dobuild true
May 1, 2025
11ee1b7
pip command
May 1, 2025
2885376
pip command versions
May 2, 2025
04a17fc
added allowlists
May 2, 2025
d4d43ae
removed allowlists
May 2, 2025
b04deef
fixed sitecustomize
May 2, 2025
c0b4be2
fixed nccl socket
May 4, 2025
8f254dc
removed jinja version
May 5, 2025
f458216
removed jinja2
May 5, 2025
f96b30f
Merge branch 'master' into pt-training2.7
Jyothirmaikottu May 5, 2025
8e3e190
commented sitecustomize and added processpoolexec logic
May 5, 2025
3f18dba
fixed nccl
May 5, 2025
e1c194b
logging
May 5, 2025
72704b5
docker logs
May 5, 2025
89f25a3
reverted thread pool logic
May 5, 2025
5aaccc0
Merge branch 'master' into pt-training2.7
Jyothirmaikottu May 5, 2025
4d950fc
jinja version
May 6, 2025
0a12a08
build with sitecustomize
May 6, 2025
0cbb925
build with sitecustomize
May 6, 2025
8a08db9
build with sitecustomize
May 6, 2025
45da8a4
reorganized
May 6, 2025
62b6b59
added python short version
May 6, 2025
4b81980
removed build true
May 7, 2025
89106d2
test package imports
May 7, 2025
5550587
Rebuild with specific versions
May 7, 2025
09b1cb8
Rebuild with index url
May 8, 2025
1673595
split installation
May 8, 2025
7329d89
fixing pip
May 8, 2025
15fef74
modified dlc template
May 8, 2025
2abde26
reverted debug
May 8, 2025
640e1b2
changed image size
May 9, 2025
13cedcc
debug mode
May 9, 2025
7077bca
Merge branch 'master' into ptTraining2.7
Jyothirmaikottu May 9, 2025
3533652
increased build size
May 9, 2025
cab9ce2
Merge branch 'master' into ptTraining2.7
Jyothirmaikottu May 9, 2025
be664f4
Merge branch 'master' into ptTraining2.7
Jyothirmaikottu May 12, 2025
d3ce116
do build false and eks test
May 13, 2025
6b89165
template changes
May 13, 2025
ab2048a
fix torch data
May 13, 2025
db70833
test utlity imports
May 13, 2025
e9f9b99
added new test for pytorch imports
May 13, 2025
badb254
new code to test pytorch import
May 13, 2025
95328af
revert new test for utility import
May 13, 2025
d475aae
modified site customize
May 13, 2025
4e0b552
removed sleep timer
May 14, 2025
3e391d3
Merge branch 'master' into ptTraining2.7
Jyothirmaikottu May 14, 2025
e09b9d0
removed if=main logic from dlc_template, retrying build
May 14, 2025
619dc3d
made changes to telemetry test added timer
May 14, 2025
10eca04
increased timer
May 15, 2025
d43cd76
add env var
May 15, 2025
c170002
added new logic for import test
May 15, 2025
754bd0e
added new logic for import test
May 15, 2025
94a7f52
added testing logic
May 15, 2025
84bc352
log exception
May 15, 2025
ab49659
added custom timeout
May 15, 2025
567ad3e
Merge branch 'master' into ptTraining2.7
Jyothirmaikottu May 15, 2025
e94a5bd
increased waiter
May 19, 2025
5890331
Merge branch 'master' into ptTraining2.7
Jyothirmaikottu May 20, 2025
8fb6f2f
build sm image
May 20, 2025
724b3a4
fixed build errors -sm
May 20, 2025
2e335a5
Merge branch 'master' into ptTraining2.7
Jyothirmaikottu May 20, 2025
48b9f64
making do build false to test
May 20, 2025
8e3635d
sm build with logs
May 21, 2025
37590e8
Merge branch 'master' into ptTraining2.7
Jyothirmaikottu May 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions dlc_developer_config.toml
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,14 @@ deep_canary_mode = false

[build]
# Add in frameworks you would like to build. By default, builds are disabled unless you specify building an image.

# available frameworks - ["base", "autogluon", "huggingface_tensorflow", "huggingface_pytorch", "huggingface_tensorflow_trcomp", "huggingface_pytorch_trcomp", "pytorch_trcomp", "tensorflow", "pytorch", "stabilityai_pytorch"]
build_frameworks = []
build_frameworks = ["pytorch"]


# By default we build both training and inference containers. Set true/false values to determine which to build.
build_training = true
build_inference = true
build_inference = false

# Set do_build to "false" to skip builds and test the latest image built by this PR
# Note: at least one build is required to set do_build to "false"
Expand Down Expand Up @@ -104,7 +105,7 @@ use_scheduler = false
### TRAINING PR JOBS ###

# Standard Framework Training
dlc-pr-pytorch-training = ""
dlc-pr-pytorch-training = "pytorch/training/buildspec-2-7-sm.yml"
dlc-pr-tensorflow-2-training = ""
dlc-pr-autogluon-training = ""

Expand Down
45 changes: 31 additions & 14 deletions miscellaneous_scripts/dlc_template.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,31 @@
import os

try:
if os.path.exists("/usr/local/bin/deep_learning_container.py") and (
os.getenv("OPT_OUT_TRACKING") is None or os.getenv("OPT_OUT_TRACKING", "").lower() != "true"
):
import threading

cmd = "python /usr/local/bin/deep_learning_container.py --framework {FRAMEWORK} --framework-version {FRAMEWORK_VERSION} --container-type {CONTAINER_TYPE} &>/dev/null"
x = threading.Thread(target=lambda: os.system(cmd))
x.setDaemon(True)
x.start()
except Exception:
pass
def main():
import os

if os.getenv("OPT_OUT_TRACKING", "").lower() == "true":
return

try:
if os.path.exists("/usr/local/bin/deep_learning_container.py"):
import sys
import subprocess

subprocess.Popen(
[
sys.executable,
"/usr/local/bin/deep_learning_container.py",
"--framework",
"{FRAMEWORK}",
"--framework-version",
"{FRAMEWORK_VERSION}",
"--container-type",
"{CONTAINER_TYPE}",
],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
start_new_session=True,
)
except:
pass


main()
72 changes: 72 additions & 0 deletions pytorch/training/buildspec-2-7-ec2.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
prod_account_id: &PROD_ACCOUNT_ID 763104351884
region: &REGION <set-$REGION-in-environment>
framework: &FRAMEWORK pytorch
version: &VERSION 2.7.0
short_version: &SHORT_VERSION "2.7"
arch_type: x86
# autopatch_build: "True"

repository_info:
training_repository: &TRAINING_REPOSITORY
image_type: &TRAINING_IMAGE_TYPE training
root: !join [ *FRAMEWORK, "/", *TRAINING_IMAGE_TYPE ]
repository_name: &REPOSITORY_NAME !join [ pr, "-", *FRAMEWORK, "-", *TRAINING_IMAGE_TYPE ]
repository: &REPOSITORY !join [ *ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *REPOSITORY_NAME ]
release_repository_name: &RELEASE_REPOSITORY_NAME !join [ *FRAMEWORK, "-", *TRAINING_IMAGE_TYPE ]
release_repository: &RELEASE_REPOSITORY !join [ *PROD_ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *RELEASE_REPOSITORY_NAME ]

context:
training_context: &TRAINING_CONTEXT
start_cuda_compat:
source: docker/build_artifacts/start_cuda_compat.sh
target: start_cuda_compat.sh
dockerd_entrypoint:
source: docker/build_artifacts/dockerd_entrypoint.sh
target: dockerd_entrypoint.sh
changehostname:
source: docker/build_artifacts/changehostname.c
target: changehostname.c
start_with_right_hostname:
source: docker/build_artifacts/start_with_right_hostname.sh
target: start_with_right_hostname.sh
example_mnist_file:
source: docker/build_artifacts/mnist.py
target: mnist.py
deep_learning_container:
source: ../../src/deep_learning_container.py
target: deep_learning_container.py

images:
BuildEC2CPUPTTrainPy3DockerImage:
<<: *TRAINING_REPOSITORY
build: &PYTORCH_CPU_TRAINING_PY3 false
image_size_baseline: 6500
device_type: &DEVICE_TYPE cpu
python_version: &DOCKER_PYTHON_VERSION py3
tag_python_version: &TAG_PYTHON_VERSION py312
os_version: &OS_VERSION ubuntu22.04
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *OS_VERSION, "-ec2" ]
latest_release_tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *OS_VERSION, "-ec2" ]
# build_tag_override: "True"
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /Dockerfile., *DEVICE_TYPE ]
target: ec2
context:
<<: *TRAINING_CONTEXT
BuildEC2GPUPTTrainPy3cu128DockerImage:
<<: *TRAINING_REPOSITORY
build: &PYTORCH_GPU_TRAINING_PY3 false
image_size_baseline: 24000
device_type: &DEVICE_TYPE gpu
python_version: &DOCKER_PYTHON_VERSION py3
tag_python_version: &TAG_PYTHON_VERSION py312
cuda_version: &CUDA_VERSION cu128
os_version: &OS_VERSION ubuntu22.04
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-ec2" ]
latest_release_tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-ec2" ]
# build_tag_override: "True"
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /, *CUDA_VERSION, /Dockerfile.,
*DEVICE_TYPE ]
target: ec2
context:
<<: *TRAINING_CONTEXT
72 changes: 72 additions & 0 deletions pytorch/training/buildspec-2-7-sm.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
prod_account_id: &PROD_ACCOUNT_ID 763104351884
region: &REGION <set-$REGION-in-environment>
framework: &FRAMEWORK pytorch
version: &VERSION 2.7.0
short_version: &SHORT_VERSION "2.7"
arch_type: x86
# autopatch_build: "True"

repository_info:
training_repository: &TRAINING_REPOSITORY
image_type: &TRAINING_IMAGE_TYPE training
root: !join [ *FRAMEWORK, "/", *TRAINING_IMAGE_TYPE ]
repository_name: &REPOSITORY_NAME !join [ pr, "-", *FRAMEWORK, "-", *TRAINING_IMAGE_TYPE ]
repository: &REPOSITORY !join [ *ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *REPOSITORY_NAME ]
release_repository_name: &RELEASE_REPOSITORY_NAME !join [ *FRAMEWORK, "-", *TRAINING_IMAGE_TYPE ]
release_repository: &RELEASE_REPOSITORY !join [ *PROD_ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *RELEASE_REPOSITORY_NAME ]

context:
training_context: &TRAINING_CONTEXT
start_cuda_compat:
source: docker/build_artifacts/start_cuda_compat.sh
target: start_cuda_compat.sh
dockerd_entrypoint:
source: docker/build_artifacts/dockerd_entrypoint.sh
target: dockerd_entrypoint.sh
changehostname:
source: docker/build_artifacts/changehostname.c
target: changehostname.c
start_with_right_hostname:
source: docker/build_artifacts/start_with_right_hostname.sh
target: start_with_right_hostname.sh
example_mnist_file:
source: docker/build_artifacts/mnist.py
target: mnist.py
deep_learning_container:
source: ../../src/deep_learning_container.py
target: deep_learning_container.py

images:
BuildSageMakerCPUPTTrainPy3DockerImage:
<<: *TRAINING_REPOSITORY
build: &PYTORCH_CPU_TRAINING_PY3 false
image_size_baseline: 6200
device_type: &DEVICE_TYPE cpu
python_version: &DOCKER_PYTHON_VERSION py3
tag_python_version: &TAG_PYTHON_VERSION py312
os_version: &OS_VERSION ubuntu22.04
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *OS_VERSION, "-sagemaker" ]
latest_release_tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *OS_VERSION, "-sagemaker" ]
# build_tag_override: "True"
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /Dockerfile., *DEVICE_TYPE ]
target: sagemaker
context:
<<: *TRAINING_CONTEXT
BuildSageMakerGPUPTTrainPy3DockerImage:
<<: *TRAINING_REPOSITORY
build: &PYTORCH_GPU_TRAINING_PY3 false
image_size_baseline: 24000
device_type: &DEVICE_TYPE gpu
python_version: &DOCKER_PYTHON_VERSION py3
tag_python_version: &TAG_PYTHON_VERSION py312
cuda_version: &CUDA_VERSION cu128
os_version: &OS_VERSION ubuntu22.04
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-sagemaker" ]
latest_release_tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-sagemaker" ]
# build_tag_override: "True"
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /, *CUDA_VERSION, /Dockerfile.,
*DEVICE_TYPE ]
target: sagemaker
context:
<<: *TRAINING_CONTEXT
2 changes: 1 addition & 1 deletion pytorch/training/buildspec.yml
Original file line number Diff line number Diff line change
@@ -1 +1 @@
buildspec_pointer: buildspec-2-6-sm.yml
buildspec_pointer: buildspec-2-7-ec2.yml
Loading
Loading