Enhancing Hugging Face Models with Tensor Parallelism for Large-Scale Model Support 🚀

### Feature request

## Description

This feature proposal aims to update Hugging Face's support for tensor parallelism (TP) to accommodate the increasing size and complexity of models such as [LLaMA 3.1](https://scontent-ssn1-1.xx.fbcdn.net/v/t39.2365-6/453304228_1160109801904614_7143520450792086005_n.pdf?_nc_cat=108&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=PC3CtquZIecQ7kNvgFOxUn_&_nc_ht=scontent-ssn1-1.xx&oh=00_AYAqF6Xje_dLZR94A1h9NDZWJ-kLjEEsGF0_H-dKHJNoFQ&oe=66B815C7), [Nemotron-4-340B-Instruct](https://build.nvidia.com/nvidia/nemotron-4-340b-instruct), and others, which have surpassed the capabilities of current training frameworks like TRL + DeepSpeed.

Currently, the Hugging Face codebase is outdated concerning these advancements. Although parallelism requires careful customization based on hardware setup, dataset size, sequence length, and model size, implementing TP across many Hugging Face models is crucial.

## Proposal

With the introduction of tensor parallelism in [PyTorch 2.0](https://pytorch.org/docs/stable/distributed.tensor.parallel.html), the previous method of creating processes per device and model in the [Megatron](https://github.com/NVIDIA/Megatron-LM) style is no longer efficient.

### Key Changes:

1. **Refactoring Code for TP:**
   - Remove the use of `kwargs` in favor of more straightforward TP implementations, as PyTorch parallel plans do not accommodate `kwargs`.
   - Refactor PyTorch models to incorporate TP effectively.

2. **Current Limitations:**
   - Existing implementations, such as in [modeling_llama](https://github.com/huggingface/transformers/blob/3d8bd11942cec26851c80c01aa5e8403542ca50b/src/transformers/models/llama/modeling_llama.py#L292), are not trainable and are incompatible with `torch.compile` for inference optimization.
   
3. **Future Integration:**
   - As models scale to large sizes, 8-way Tensor Parallel is becoming standard.
   - This change would enable `Accelerate` to later support TP + FSDP (Fully Sharded Data Parallel), which many users could benefit from.

## Personal Contribution

I have personally developed code that allows LLaMA to run entirely with TP, observing that it handles longer token sequences with less memory than FSDP. However, I have not submitted a pull request due to the need for comprehensive code refactoring.

## Call to Action

If Hugging Face acknowledges this need, I am willing to contribute further if there is an overarching plan for abstraction and integration.

### Motivation

## Motivation

The motivation behind this proposal is to address the limitations and frustrations experienced when using Hugging Face with the current parallelism approaches, especially for large-scale models like LLaMA 3.1 and Nemotron-4-340B-Instruct. As models grow in complexity, existing frameworks struggle to support efficient training and inference.

### Current Issues with Existing Solutions:

- **[NVIDIA Megatron-LM](https://github.com/NVIDIA/Megatron-LM)**: Lacks compiler-level optimization and is somewhat outdated.
- **[Tensor Parallel by BlackSamorez](https://github.com/BlackSamorez/tensor_parallel)**: Also lacks compiler-level optimization and is outdated.
- **[DeepSpeed](https://github.com/microsoft/DeepSpeed)**: Primarily uses data parallelism (DP), with ZeRO closer to model parallelism (MP) rather than tensor parallelism (TP). It also has issues with ZeRO Stage 3.
- **[AWS Neuron Distributed](https://github.com/aws-neuron/neuronx-distributed)**: Potentially supports TP in distributed settings, though not tested extensively.
- **[PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/advanced/model_parallel/tp.html)**: Implements TP but is not applicable to Hugging Face models.
- **[NVIDIA NeMo](https://github.com/NVIDIA/NeMo?tab=readme-ov-file)**: Uses PyTorch Lightning, underscoring the need for Hugging Face to adopt TP, including coding styles like avoiding `kwargs`.

Implementing tensor parallelism (TP) in Hugging Face models is crucial to keep up with the trend towards larger models and to enhance compatibility with modern optimization techniques like `torch.compile`.

### Your contribution

## Contribution

I am willing to contribute to implementing tensor parallelism (TP) within the Hugging Face ecosystem. To facilitate this, I would appreciate guidance on the following aspects:

1. **Integration Approach**: Clarification on whether TP should be applied during model initialization, such as with `AutoModelForCausalLM`, or if it should be managed externally using `torchrun`.

2. **Automatic Initialization**: Decide if the implementation should automatically initialize `torch.distributed` without requiring explicit commands from users.

With a defined plan or abstraction level, I can work on refactoring the necessary code and submit a pull request to integrate TP effectively. My experience with TP, particularly with LLaMA, has demonstrated its efficiency in handling large models with reduced memory usage compared to current methods.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhancing Hugging Face Models with Tensor Parallelism for Large-Scale Model Support 🚀 #32470

Feature request

Description

Proposal

Key Changes:

Personal Contribution

Call to Action

Motivation

Motivation

Current Issues with Existing Solutions:

Your contribution

Contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enhancing Hugging Face Models with Tensor Parallelism for Large-Scale Model Support 🚀 #32470

Description

Feature request

Description

Proposal

Key Changes:

Personal Contribution

Call to Action

Motivation

Motivation

Current Issues with Existing Solutions:

Your contribution

Contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions