Skip to content

Commit f349230

Browse files
authored
Merge branch 'master' into inference-id
2 parents 0d4ab15 + 570c678 commit f349230

23 files changed

+1459
-104
lines changed

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
# Changelog
22

3+
## v2.23.6 (2021-01-20)
4+
5+
### Bug Fixes and Other Changes
6+
7+
* add artifact, action, context to virsualizer
8+
39
## v2.23.5 (2021-01-18)
410

511
### Bug Fixes and Other Changes

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.23.6.dev0
1+
2.23.7.dev0

doc/api/training/smd_data_parallel.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
###################################
1+
##########################
22
Distributed data parallel
3-
###################################
3+
##########################
44

55
SageMaker's distributed data parallel library extends SageMaker’s training
66
capabilities on deep learning models with near-linear scaling efficiency,
@@ -68,5 +68,5 @@ model.
6868
.. toctree::
6969
:maxdepth: 2
7070

71-
smd_data_parallel_pytorch
72-
smd_data_parallel_tensorflow
71+
sdp_versions/smd_data_parallel_pytorch
72+
sdp_versions/smd_data_parallel_tensorflow

doc/api/training/smd_model_parallel.rst

Lines changed: 21 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -20,20 +20,31 @@ Use the following sections to learn more about the model parallelism and the lib
2020
<https://integ-docs-aws.amazon.com/sagemaker/latest/dg/model-parallel-use-api.html#model-parallel-customize-container>`__
2121
for more information.
2222

23-
How to Use this Guide
24-
=====================
23+
Use with the SageMaker Python SDK
24+
=================================
25+
26+
Use the following page to learn how to configure and enable distributed model parallel
27+
when you configure an Amazon SageMaker Python SDK `Estimator`.
28+
29+
.. toctree::
30+
:maxdepth: 1
31+
32+
smd_model_parallel_general
33+
34+
API Documentation
35+
=================
2536

2637
The library contains a Common API that is shared across frameworks, as well as APIs
27-
that are specific to supported frameworks, TensorFlow and PyTorch. To use the library, reference the
38+
that are specific to supported frameworks, TensorFlow and PyTorch.
39+
40+
Select a version to see the API documentation for version. To use the library, reference the
2841
**Common API** documentation alongside the framework specific API documentation.
2942

3043
.. toctree::
3144
:maxdepth: 1
3245

33-
smd_model_parallel_general
34-
smd_model_parallel_common_api
35-
smd_model_parallel_pytorch
36-
smd_model_parallel_tensorflow
46+
smp_versions/v1_2_0.rst
47+
smp_versions/v1_1_0.rst
3748

3849
It is recommended to use this documentation alongside `SageMaker Distributed Model Parallel
3950
<http://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html>`__ in the Amazon SageMaker
@@ -49,11 +60,11 @@ developer guide. This developer guide documentation includes:
4960
- `Configuration tips and pitfalls
5061
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-tips-pitfalls.html>`__
5162

52-
Latest Updates
53-
==============
5463

55-
New features, bug fixes, and improvements are regularly made to the SageMaker distributed model parallel library.
64+
Release Notes
65+
=============
5666

67+
New features, bug fixes, and improvements are regularly made to the SageMaker distributed model parallel library.
5768
To see the the latest changes made to the library, refer to the library
5869
`Release Notes
5970
<https://github.com/aws/sagemaker-python-sdk/blob/master/doc/api/training/smd_model_parallel_release_notes/>`_.

doc/api/training/smd_model_parallel_general.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
.. _sm-sdk-modelparallel-params:
77

88
SageMaker Python SDK ``modelparallel`` parameters
9-
-------------------------------------------------
9+
=================================================
1010
1111
The TensorFlow and PyTorch ``Estimator`` objects contains a ``distribution`` parameter,
1212
which is used to enable and specify parameters for the
@@ -306,7 +306,7 @@ table are optional.
306306
.. _ranking-basics:
307307

308308
Ranking Basics
309-
--------------
309+
==============
310310

311311
The library maintains a one-to-one mapping between processes and available GPUs:
312312
for each GPU, there is a corresponding CPU process. Each CPU process

doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,44 @@
1+
# Sagemaker Distributed Model Parallel 1.2.0 Release Notes
2+
3+
- New Features
4+
- Bug Fixes
5+
- Known Issues
6+
7+
## New Features
8+
9+
### PyTorch
10+
11+
#### Add support for PyTorch 1.7
12+
13+
- Adds support for `gradient_as_bucket_view` (PyTorch 1.7 only), `find_unused_parameters` (PyTorch 1.7 only) and `broadcast_buffers` options to `smp.DistributedModel`. These options behave the same as the corresponding options (with the same names) in
14+
`torch.DistributedDataParallel` API. Please refer to the [SageMaker distributed model parallel API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_pytorch.html#smp.DistributedModel) for more information.
15+
16+
- Adds support for `join` (PyTorch 1.7 only) context manager, which is to be used in conjunction with an instance of `smp.DistributedModel` to be able to train with uneven inputs across participating processes.
17+
18+
- Adds support for `_register_comm_hook` (PyTorch 1.7 only) which will register the callable as a communication hook for DDP. NOTE: Like in DDP, this is an experimental API and subject to change.
19+
20+
### Tensorflow
21+
22+
- Adds support for Tensorflow 2.4
23+
24+
## Bug Fixes
25+
26+
### PyTorch
27+
28+
- `Serialization`: Fix a bug with serialization/flattening where instances of subclasses of dict/OrderedDicts were serialized/deserialized or internally flattened/unflattened as
29+
regular dicts.
30+
31+
### Tensorflow
32+
33+
- Fix a bug that may cause a hang during evaluation when there is no model input for one partition.
34+
35+
## Known Issues
36+
37+
### PyTorch
38+
39+
- A performance regression was observed when training on SMP with PyTorch 1.7.1 compared to 1.6. The rootcause was found to be the slowdown in performance of `.grad` method calls in PyTorch 1.7.1 compared to 1.6. Please see the related discussion: https://github.com/pytorch/pytorch/issues/50636.
40+
41+
142
# Sagemaker Distributed Model Parallel 1.1.0 Release Notes
243

344
- New Features

0 commit comments

Comments
 (0)