Skip to content

Commit 0dc0e52

Browse files
authored
Merge branch 'master' into hf-inference
2 parents 86ca70e + f4ad8b2 commit 0dc0e52

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+3382
-406
lines changed

CHANGELOG.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,67 @@
11
# Changelog
22

3+
## v2.47.2 (2021-06-30)
4+
5+
### Bug Fixes and Other Changes
6+
7+
* handle tags when upsert pipeine
8+
9+
## v2.47.1 (2021-06-27)
10+
11+
### Bug Fixes and Other Changes
12+
13+
* revert "fix: jsonGet interpolation issue 2426 + allow step depends on pass in step instance (#2477)"
14+
15+
## v2.47.0 (2021-06-25)
16+
17+
### Features
18+
19+
* support job_name_prefix for Clarify
20+
21+
### Bug Fixes and Other Changes
22+
23+
* Add configuration option with headers for Clarify Explainability
24+
* jsonGet interpolation issue 2426 + allow step depends on pass in step instance
25+
* add default retries to feature group ingestion.
26+
* Update using_pytorch.rst
27+
* kms key does not propapate in register model step
28+
* Correctly interpolate Callback output parameters
29+
30+
## v2.46.1 (2021-06-22)
31+
32+
### Bug Fixes and Other Changes
33+
34+
* Register model step tags
35+
36+
### Documentation Changes
37+
38+
* update to include new batch_get_record api call
39+
* Correct type annotation for TrainingStep inputs
40+
* introduce input mode FastFile
41+
* update hf transformer version
42+
43+
## v2.46.0 (2021-06-15)
44+
45+
### Features
46+
47+
* Add HF transformer version 4.6.1
48+
49+
### Bug Fixes and Other Changes
50+
51+
* encode localmode payload to UTF-8
52+
* call DescribeDomain as fallback in get_execution_role
53+
* parameterize PT and TF version for HuggingFace tests
54+
55+
### Documentation Changes
56+
57+
* Add import statement in Batch Transform Overview doc
58+
59+
## v2.45.0 (2021-06-07)
60+
61+
### Features
62+
63+
* Add support for Callback steps in model building pipelines
64+
365
## v2.44.0 (2021-06-01)
466

567
### Features

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.44.1.dev0
1+
2.47.3.dev0

doc/amazon_sagemaker_featurestore.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -291,6 +291,13 @@ example identifier to retrieve the record.
291291
record_identifier_value = str(2990130)
292292
featurestore_runtime.get_record(FeatureGroupName=transaction_feature_group_name, RecordIdentifierValueAsString=record_identifier_value)
293293
294+
You can use the ``batch_get_record`` function to retrieve multiple records simultaneously from your feature store. The following example uses this API to retrieve a batch of records.
295+
296+
.. code:: python
297+
298+
record_identifier_values = ["573291", "109382", "828400", "124013"]
299+
featurestore_runtime.batch_get_record(Identifiers=[{"FeatureGroupName": transaction_feature_group_name, "RecordIdentifiersValueAsString": record_identifier_values}])
300+
294301
An example response from the fraud detection example:
295302
296303
.. code:: python

doc/api/training/sdp_versions/latest.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11

2-
Version 1.2.0 (Latest)
2+
Version 1.2.x (Latest)
33
======================
44

55
.. toctree::

doc/api/training/sdp_versions/latest/smd_data_parallel_tensorflow.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -157,7 +157,7 @@ TensorFlow API
157157

158158
.. rubric:: Supported versions
159159

160-
**TensorFlow 2.3.1, 2.4.1**
160+
**TensorFlow 2.3.1, 2.4.1, 2.5.0**
161161

162162
.. function:: smdistributed.dataparallel.tensorflow.init()
163163

doc/api/training/smd_data_parallel.rst

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -101,8 +101,10 @@ Select a version to see the API documentation for version.
101101
Release Notes
102102
=============
103103

104-
New features, bug fixes, and improvements are regularly made to the SageMaker distributed data parallel library.
104+
New features, bug fixes, and improvements are regularly made to the SageMaker
105+
distributed data parallel library.
105106

106-
To see the the latest changes made to the library, refer to the library
107-
`Release Notes
108-
<https://github.com/aws/sagemaker-python-sdk/blob/master/doc/api/training/smd_data_parallel_release_notes/>`_.
107+
.. toctree::
108+
:maxdepth: 1
109+
110+
smd_data_parallel_release_notes/smd_data_parallel_change_log

doc/api/training/smd_data_parallel_release_notes/smd_data_parallel_change_log.md

Lines changed: 0 additions & 91 deletions
This file was deleted.
Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
Sagemaker Distributed Data Parallel 1.2.1 Release Notes
2+
=======================================================
3+
4+
*Date: June. 29. 2021*
5+
6+
**New Features:**
7+
8+
- Added support for TensorFlow 2.5.0.
9+
10+
**Improvements**
11+
12+
- Improved performance on a single node and small clusters (2-4 nodes).
13+
14+
**Bug fixes**
15+
16+
- Enable ``sparse_as_dense`` by default for SageMaker distributed data
17+
parallel library for TensorFlow APIs: ``DistributedGradientTape`` and
18+
``DistributedOptimizer``.
19+
20+
**Migration to AWS Deep Learning Containers**
21+
22+
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:
23+
24+
- TensorFlow 2.5.0 DLC release: `v1.0-tf-2.5.0-tr-py37
25+
<https://github.com/aws/deep-learning-containers/releases/tag/v1.0-tf-2.5.0-tr-py37>`__
26+
27+
.. code::
28+
29+
763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.5.0-gpu-py37-cu112-ubuntu18.04-v1.0
30+
31+
----
32+
33+
Release History
34+
===============
35+
36+
Sagemaker Distributed Data Parallel 1.2.0 Release Notes
37+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
38+
39+
- New features
40+
- Bug Fixes
41+
42+
**New features:**
43+
44+
- Support of `EFA network
45+
interface <https://aws.amazon.com/hpc/efa/>`__ for distributed
46+
AllReduce. For best performance, it is recommended you use an
47+
instance type that supports Amazon Elastic Fabric Adapter
48+
(ml.p3dn.24xlarge and ml.p4d.24xlarge) when you train a model using
49+
Sagemaker Distributed data parallel.
50+
51+
**Bug Fixes:**
52+
53+
- Improved performance on single node and small clusters.
54+
55+
----
56+
57+
Sagemaker Distributed Data Parallel 1.1.2 Release Notes
58+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
59+
60+
- Bug Fixes
61+
- Known Issues
62+
63+
**Bug Fixes:**
64+
65+
- Fixed a bug that caused some TensorFlow operations to not work with
66+
certain data types. Operations forwarded from C++ have been extended
67+
to support every dtype supported by NCCL.
68+
69+
**Known Issues:**
70+
71+
- Sagemaker Distributed data parallel has slower throughput than NCCL
72+
when run using a single node. For the best performance, use
73+
multi-node distributed training with smdistributed.dataparallel. Use
74+
a single node only for experimental runs while preparing your
75+
training pipeline.
76+
77+
----
78+
79+
Sagemaker Distributed Data Parallel 1.1.1 Release Notes
80+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
81+
82+
- New Features
83+
- Bug Fixes
84+
- Known Issues
85+
86+
**New Features:**
87+
88+
- Adds support for PyTorch 1.8.1
89+
90+
**Bug Fixes:**
91+
92+
- Fixes a bug that was causing gradients from one of the worker nodes
93+
to be added twice resulting in incorrect ``all_reduce`` results under
94+
some conditions.
95+
96+
**Known Issues:**
97+
98+
- SageMaker distributed data parallel still is not efficient when run
99+
using a single node. For the best performance, use multi-node
100+
distributed training with ``smdistributed.dataparallel``. Use a
101+
single node only for experimental runs while preparing your training
102+
pipeline.
103+
104+
----
105+
106+
Sagemaker Distributed Data Parallel 1.1.0 Release Notes
107+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
108+
109+
- New Features
110+
- Bug Fixes
111+
- Improvements
112+
- Known Issues
113+
114+
**New Features:**
115+
116+
- Adds support for PyTorch 1.8.0 with CUDA 11.1 and CUDNN 8
117+
118+
**Bug Fixes:**
119+
120+
- Fixes crash issue when importing ``smdataparallel`` before PyTorch
121+
122+
**Improvements:**
123+
124+
- Update ``smdataparallel`` name in python packages, descriptions, and
125+
log outputs
126+
127+
**Known Issues:**
128+
129+
- SageMaker DataParallel is not efficient when run using a single node.
130+
For the best performance, use multi-node distributed training with
131+
``smdataparallel``. Use a single node only for experimental runs
132+
while preparing your training pipeline.
133+
134+
Getting Started
135+
136+
For getting started, refer to SageMaker Distributed Data Parallel Python
137+
SDK Guide
138+
(https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api).
139+
140+
----
141+
142+
Sagemaker Distributed Data Parallel 1.0.0 Release Notes
143+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
144+
145+
- First Release
146+
- Getting Started
147+
148+
First Release
149+
-------------
150+
151+
SageMaker’s distributed data parallel library extends SageMaker’s
152+
training capabilities on deep learning models with near-linear scaling
153+
efficiency, achieving fast time-to-train with minimal code changes.
154+
SageMaker Distributed Data Parallel:
155+
156+
- optimizes your training job for AWS network infrastructure and EC2
157+
instance topology.
158+
- takes advantage of gradient update to communicate between nodes with
159+
a custom AllReduce algorithm.
160+
161+
The library currently supports TensorFlow v2 and PyTorch via `AWS Deep
162+
Learning
163+
Containers <https://aws.amazon.com/machine-learning/containers/>`__.
164+
165+
Getting Started
166+
---------------
167+
168+
For getting started, refer to `SageMaker Distributed Data Parallel
169+
Python SDK
170+
Guide <https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api>`__.

0 commit comments

Comments
 (0)