[doc][c10d] fixup fsdp tutorial #1297

c-p-i-o · 2024-10-31T16:53:32Z

Summary:
Fix up the FSDP tutorial to get it functional again.

Add missing import for load_dataset.
Use checkpoint instead of _shard.checkpoint to get rid of a warning.
Add nlp to requirements.txt
Get rid of load_metric as this function does not exist in new datasets module.
Add legacy=False to get rid of tokenizer warnings.

Test Plan:
Ran the tutorial as follows and ensured that it ran successfully:

torchrun --nnodes=1 --nproc_per_node=2 T5_training.py
W1031 09:46:49.166000 2847649 torch/distributed/run.py:793]
W1031 09:46:49.166000 2847649 torch/distributed/run.py:793]
*****************************************
W1031 09:46:49.166000 2847649 torch/distributed/run.py:793] Setting
OMP_NUM_THREADS environment variable for each process to be 1 in
default, to avoid your system being overloaded, please further tune the
variable for optimal performance in your application as needed.
W1031 09:46:49.166000 2847649 torch/distributed/run.py:793]
*****************************************
dict_keys(['train', 'validation', 'test'])
Size of train dataset:  (157252, 3)
Size of Validation dataset:  (5599, 3)
dict_keys(['train', 'validation', 'test'])
Size of train dataset:  (157252, 3)
Size of Validation dataset:  (5599, 3)
bFloat16 enabled for mixed precision - using bfSixteen policy

netlify · 2024-10-31T16:53:51Z

✅ Deploy Preview for pytorch-examples-preview canceled.

Name	Link
🔨 Latest commit	`cb00288`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-examples-preview/deploys/672a8d974a588a00083764e1

fduwjj · 2024-10-31T20:08:44Z

looks like running python example failed?

c-p-i-o · 2024-10-31T20:53:12Z

looks like running python example failed?

Unrelated to my change - but I fixed it anyway. Needed to update to a newer Python version in CI.

See the additional diff I made to .github/workflows/main_python.yml. Let me know if you want me to split this out into a separate change.

fegin · 2024-11-01T06:59:39Z

distributed/FSDP/model_checkpointing/checkpoint_handler.py

            print(f" checkpoint key len = {len(ck)} and \n keys =  {ck}")
-      
+
        dist_cp.load_state_dict(


The DCP usage is pretty outdated. Should we also update them?

I will update this in a subsequent change - if that's ok with you?
This change is already too large as I am fixing up the python tests that broke.
The breakage is unrelated to this change.

c-p-i-o · 2024-11-01T17:26:53Z

@fduwjj - CI is green now.
As mentioned, the CI broke because of some dependency changes upstream and I had to do 3 things to fix Run Python Examples.

Use newer Python in .github/workflows
Pin numpy to below version 2.
Pin torchvision.

c-p-i-o · 2024-11-05T17:57:56Z

This change will be rebased on #1299 to fix the failing Python Examples.

Summary: Fix up the FSDP tutorial to get it functional again. 1. Add missing import for load_dataset. 2. Use `checkpoint` instead of `_shard.checkpoint` to get rid of a warning. 3. Add nlp to requirements.txt 4. Get rid of `load_metric` as this function does not exist in new `datasets` module. 5. Add `legacy=False` to get rid of tokenizer warnings. Test Plan: Ran the tutorial as follows and ensured that it ran successfully: ``` torchrun --nnodes=1 --nproc_per_node=2 T5_training.py W1031 09:46:49.166000 2847649 torch/distributed/run.py:793] W1031 09:46:49.166000 2847649 torch/distributed/run.py:793] ***************************************** W1031 09:46:49.166000 2847649 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W1031 09:46:49.166000 2847649 torch/distributed/run.py:793] ***************************************** dict_keys(['train', 'validation', 'test']) Size of train dataset: (157252, 3) Size of Validation dataset: (5599, 3) dict_keys(['train', 'validation', 'test']) Size of train dataset: (157252, 3) Size of Validation dataset: (5599, 3) bFloat16 enabled for mixed precision - using bfSixteen policy ```

c-p-i-o requested review from HamidShojanazeri, fduwjj and svekars October 31, 2024 16:53

c-p-i-o self-assigned this Oct 31, 2024

facebook-github-bot added the cla signed label Oct 31, 2024

c-p-i-o force-pushed the chirag/fix-fsdp-tutorial branch from 2cfc3b8 to 0834097 Compare October 31, 2024 20:47

c-p-i-o force-pushed the chirag/fix-fsdp-tutorial branch 2 times, most recently from 752efec to 0834097 Compare November 1, 2024 00:18

fegin reviewed Nov 1, 2024

View reviewed changes

c-p-i-o force-pushed the chirag/fix-fsdp-tutorial branch from e0fba21 to 7dcd080 Compare November 1, 2024 17:05

c-p-i-o requested a review from fegin November 1, 2024 17:25

c-p-i-o mentioned this pull request Nov 5, 2024

Fix python failing tests #1299

Merged

c-p-i-o force-pushed the chirag/fix-fsdp-tutorial branch from 7dcd080 to 162b0dc Compare November 5, 2024 17:57

c-p-i-o force-pushed the chirag/fix-fsdp-tutorial branch from 162b0dc to cb00288 Compare November 5, 2024 21:26

fduwjj approved these changes Nov 8, 2024

View reviewed changes

c-p-i-o merged commit 1bef748 into main Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[doc][c10d] fixup fsdp tutorial #1297

[doc][c10d] fixup fsdp tutorial #1297

c-p-i-o commented Oct 31, 2024

Uh oh!

netlify bot commented Oct 31, 2024 •

edited

Loading

Uh oh!

fduwjj commented Oct 31, 2024

Uh oh!

c-p-i-o commented Oct 31, 2024 •

edited

Loading

Uh oh!

fegin Nov 1, 2024

Uh oh!

c-p-i-o Nov 1, 2024

Uh oh!

fegin Nov 1, 2024

Uh oh!

c-p-i-o commented Nov 1, 2024

Uh oh!

c-p-i-o commented Nov 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		print(f" checkpoint key len = {len(ck)} and \n keys = {ck}")


		dist_cp.load_state_dict(

[doc][c10d] fixup fsdp tutorial #1297

[doc][c10d] fixup fsdp tutorial #1297

Conversation

c-p-i-o commented Oct 31, 2024

Uh oh!

netlify bot commented Oct 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-examples-preview canceled.

Uh oh!

fduwjj commented Oct 31, 2024

Uh oh!

c-p-i-o commented Oct 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fegin Nov 1, 2024

Choose a reason for hiding this comment

Uh oh!

c-p-i-o Nov 1, 2024

Choose a reason for hiding this comment

Uh oh!

fegin Nov 1, 2024

Choose a reason for hiding this comment

Uh oh!

c-p-i-o commented Nov 1, 2024

Uh oh!

c-p-i-o commented Nov 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

netlify bot commented Oct 31, 2024 •

edited

Loading

c-p-i-o commented Oct 31, 2024 •

edited

Loading