-
Notifications
You must be signed in to change notification settings - Fork 988
Accelerate nd-parallel #3006
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accelerate nd-parallel #3006
Changes from 5 commits
d977453
04d68ef
f0625ec
6291a7a
07584ae
1e88190
ffa949c
a59398a
9f5c849
7e60161
81ec327
2de1be1
bd43bae
22f2bd2
b8ed352
8fe0f4d
6123169
d8e50bd
00589c6
7d1ca60
77475a5
e117c8c
e1d0cc1
1d2109f
fe61189
565a7e4
02a349f
6013cbd
c64ff90
c7d99cd
c86fe7f
e921e44
a42a30c
5d053c8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,138 @@ | ||||||
| --- | ||||||
| title: "Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training" | ||||||
| thumbnail: /blog/assets/accelerate-nd-parallel/thumbnail.png | ||||||
| authors: | ||||||
| - user: siro1 | ||||||
| - user: smohammadi | ||||||
| guest: true | ||||||
| org: axolotl-ai-co | ||||||
| - user: winglian | ||||||
| guest: true | ||||||
| org: axolotl-ai-co | ||||||
| - user: djsaunde | ||||||
| guest: true | ||||||
| org: axolotl-ai-co | ||||||
| --- | ||||||
|
|
||||||
| # Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training | ||||||
|
|
||||||
| Training large models on multiple GPUs can be challenging due to the complexities of different parallelism strategies. | ||||||
| In Accelerate, together with [Axolotl]((https://huggingface.co/axolotl-ai-co), we have integrated a quick and easy way | ||||||
| to use any combination of parallelism strategies in your training script! | ||||||
|
|
||||||
| Here is how to add it to your training script: | ||||||
|
|
||||||
| ```python | ||||||
| from transformers import AutoModelForCausalLM | ||||||
| from accelerate import Accelerator | ||||||
| from accelerate.parallelism_config import ParallelismConfig | ||||||
|
|
||||||
| pc = ParallelismConfig( | ||||||
| dp_shard_size=2, | ||||||
| dp_replicate_size=2, | ||||||
| cp_size=2, | ||||||
| tp_size=2, | ||||||
| ) | ||||||
|
|
||||||
| accelerator = Accelerator( | ||||||
| parallelism_config=pc, | ||||||
| ) | ||||||
| model = AutoModelForCausalLM.from_pretrained("your-model-name", tp_size=pc.tp_size, device_mesh=accelerator.torch_device_mesh) | ||||||
| model = accelerator.prepare(model) | ||||||
| ``` | ||||||
|
|
||||||
| To compose a variety of fine-tuning techniques and further streamline fine-tuning models at scale, we've integrated this technique into Axolotl. Check out the [Axolotl ND-Parallelism docs](https://docs.axolotl.ai/docs/nd_parallelism.html) to get started in just a few minutes. | ||||||
|
|
||||||
| ```yaml | ||||||
| dp_shard_size: 2 | ||||||
| dp_replicate_size: 2 | ||||||
| context_parallel_size: 2 | ||||||
| tensor_parallel_size: 2 | ||||||
| ``` | ||||||
|
|
||||||
| To get up and running quickly, you can check the examples in the [accelerate repository](https://github.com/huggingface/accelerate/blob/main/examples/fsdp2/nd_parallel.py) or their counterpart in [Axolotl](TODO) | ||||||
|
||||||
|
|
||||||
| You can see we are using the `ParallelismConfig` class to define the parallelism combination and its shape, but how | ||||||
| do we figure out what shape will work the best for our case? Understanding the different parallelism strategies | ||||||
| and how they interact is the primary challenge when training models at the scale of 10s or 100s of billions of parameters. In this post, we'll walk through the different parallelism strategies and how to compose them to enable | ||||||
| training at such scale. | ||||||
|
|
||||||
| ## Data Parallelism (dp_replicate_size) | ||||||
|
|
||||||
| Data parallelism (DP) is the most common technique for training models across multiple GPUs, and involves replicating the model, gradients and optimizer states across each device, whilst evenly distributing data batches between GPUs, and synchronising gradients across devices before updating parameters. This can significantly increase throughput compared to single-device training, but requires that your model is able to fit on a single GPU. We can control | ||||||
| the number of replicas of the model with the `dp_replicate_size` parameter. | ||||||
|
|
||||||
| DP is a `top-most-level` parallelism strategy, meaning that if we use `dp_replicate_size=2` and we compose it with other parallelism strategies, there would be 2 replicas of the model, each also influenced by the other parallelism strategies. For example, if we use `dp_replicate_size=2` and `tp_size=2`, we would have 2 replicas of the model, each with 2 tensor parallel shards, but more on that later. | ||||||
|
|
||||||
| ## Fully Sharded Data Parallelism (dp_shard_size) | ||||||
|
|
||||||
| What if our model is too large to fit on a single GPU? Fully sharded data parallel (FSDP) addresses this issue by sharding (distributing evenly) the model’s weights, gradients, and optimizer states across GPUs (this is inspired by DeepSpeed’s ZeRO-3), whilst each device still receives its portion of the full batch of data. As you may notice from the diagram above, rather than requiring a full copy of the entire model on each device, we only gather the weights for a single layer at a time before the forward pass, after which the weights may be sharded again. | ||||||
|
|
||||||
| In this way, we trade memory usage for the communication overhead of gathering sharded parameters before each forward and backward pass, and reduce-scatter-ing local gradients. We can control this trade-off in FSDP by tuning the granularity at which parameters are gathered. One one extreme, we can gather and re-reshard every layer of our model, which would result in the lowest peak memory usage, but incur the highest communication costs. In practice, a common approach is to gather the weights for an entire transformer decoder block at a time. | ||||||
|
|
||||||
| Whilst we can make further memory-compute tradeoffs and offload model parameters and gradients to the CPU to train larger models, this can be prohibitively slow. Instead, let’s consider how we can effectively utilise even more devices to train larger models whilst maintaining high data throughput. | ||||||
|
|
||||||
| We use the term node to refer to a single machine which hosts multiple GPUs (often 8), with fast intra-node communication channels using e.g. NVLink between devices. When utilising multiple nodes for training, we rely on relatively slower inter-node communication channels between machines using e.g. Infiniband. We also refer to the total number of devices in the process pool as the world size - e.g. a single node with 8 GPUs represents a world size of 8, and 4 nodes would represent a world size of 32. | ||||||
|
||||||
| We use the term node to refer to a single machine which hosts multiple GPUs (often 8), with fast intra-node communication channels using e.g. NVLink between devices. When utilising multiple nodes for training, we rely on relatively slower inter-node communication channels between machines using e.g. Infiniband. We also refer to the total number of devices in the process pool as the world size - e.g. a single node with 8 GPUs represents a world size of 8, and 4 nodes would represent a world size of 32. | |
| We use the term node to refer to a single machine which hosts multiple GPUs (often 8), with fast intra-node communication channels using e.g. NVLink between devices. When using multiple nodes for training, we rely on relatively slower inter-node communication channels between machines using e.g. Infiniband. We also refer to the total number of devices in the process pool as the world size - e.g. a single node with 8 GPUs represents a world size of 8, and 4 nodes would represent a world size of 32. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a final section for coming items that are being worked on, e,.g trainer and trl integration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to specify the tp size normally as this information is in the device_mesh