Set ccl and KMP param in simple launch by jiqing-feng · Pull Request #3575 · huggingface/accelerate

jiqing-feng · 2025-05-16T08:18:29Z

Don't know why it assigns CCL_WORKER_COUNT only when machine> 1 because 1 CPU machine can also run distributed training or Tensor Parallelism.

I also added KMP params to get better performance on CPU.

With this PR we can run transformers TP model and got 40% speed-up on Intel 4th Gen Xeon.
The accelerate config is :

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_CPU
downcast_bf16: 'no'
enable_cpu_affinity: false
ipex_config:
  ipex: false
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
mpirun_config:
  mpirun_ccl: '1'
  mpirun_hostfile: /home/jiqingfe/hostfile
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: true

The script is as follows
accelerate launch script.py

import os
import torch.distributed as dist
from transformers import AutoTokenizer, AutoModelForCausalLM
import oneccl_bindings_for_pytorch

import time
import torch

print(f"Using {torch.get_num_threads()} threads (PyTorch)")
print(f"OMP_NUM_THREADS={os.getenv('OMP_NUM_THREADS')}")

model_id = "meta-llama/Llama-3.1-8B-Instruct"

os.environ['RANK'] = str(os.environ.get('PMI_RANK', 0))
os.environ['WORLD_SIZE'] = str(os.environ.get('PMI_SIZE', 1))

def main(is_tp, rank, world_size) -> None:
    print("is_tp, rank, world_size: ", is_tp, rank, world_size)
    model_kwargs = dict(torch_dtype=torch.bfloat16)
    if is_tp:
        model_kwargs["tp_plan"] = "auto"
    else:
        model_kwargs["device_map"] = "cpu"

    # Retrieve tensor parallel model
    model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs)
    if dist.is_initialized():
        print("Backend:", dist.get_backend())
    else:
        print("Distributed process group is not initialized.")

    # Prepare input tokens
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    prompt = "It is done, and submitted. You can play 'Survival of the Tastiest' on Android, and on the web. Playing on the web works, but you have to simulate multiple touch for table moving and that can be a bit confusing. There is a lot I'd like to talk about. I will go through every topic, insted of making the typical what went right/wrong list. Concept Working over the theme was probably one of the hardest tasks which I had to face. Originally, I had an idea of what kind of game I wanted to develop, gameplay wise - something with a lot of enemies/actors"
    inputs = tokenizer(prompt, return_tensors="pt", max_length=512).to(model.device)
    print(f"inpu shape is {inputs.input_ids.shape}")

    model.generation_config.cache_implementation = "static"

    if is_tp:
        model.config.hidden_size = model.config.hidden_size // world_size
        model.config.num_key_value_heads = model.config.num_key_value_heads // world_size

    for i in range(1):
        with torch.no_grad():
            start = time.time()
            outputs = model.generate(**inputs, do_sample=False, max_new_tokens=128, min_new_tokens=128)
            end = time.time()
            print(f"time cost {(end-start)*1000} ms")

    # warm-up
    if is_tp:
        dist.barrier()

    if rank == 0:
        print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

    model.forward = torch.compile(model.forward)
    # warm-up
    if is_tp:
        dist.barrier()

    for i in range(4):
        with torch.no_grad():
            start = time.time()
            outputs = model.generate(**inputs, do_sample=False, max_new_tokens=128, min_new_tokens=128)
            if is_tp:
                dist.barrier()

            end = time.time()
            print(f"time cost {(end-start)*1000} ms")

    if rank == 0:
        print(tokenizer.batch_decode(outputs, skip_special_tokens=True))


if __name__ == "__main__":
    rank = int(os.environ["RANK"]) if "RANK" in os.environ else 0
    world_size = int(os.environ["WORLD_SIZE"]) if "WORLD_SIZE" in os.environ else 1
    is_tp = world_size > 1
    main(is_tp, rank, world_size)

jiqing-feng · 2025-05-16T08:18:55Z

@sywangyi @yao-matrix . Please review this PR, thanks!

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

yao-matrix · 2025-05-19T23:49:26Z

does ipex: true work？
in the case we don't have accelerate config, what's the behavior?

jiqing-feng · 2025-05-20T02:57:29Z

Hi @yao-matrix . I have verified that both ipex is True and no config can work as before.

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng · 2025-05-21T01:42:49Z

Hi @SunMarc . Could you please review this PR? Thanks!

SunMarc

Thanks, left a couple of comments

src/accelerate/utils/launch.py

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

SunMarc

LGTM, just a small nit

src/accelerate/utils/launch.py

jiqing-feng · 2025-05-26T02:41:09Z

LGTM, just a small nit

Fixed!

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

SunMarc

Thanks ! LGTM !

HuggingFaceDocBuilderDev · 2025-05-26T13:39:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

commit 2f8fd72 Author: Simon <80467011+sorgfresser@users.noreply.github.com> Date: Tue Jun 10 13:50:34 2025 +0100 Remove device_count (#3587) commit d2e6b03 Author: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com> Date: Tue Jun 10 05:26:48 2025 -0700 [FSDP2] Refactor + FP8 (#3585) * Fix double wrap * Clocking off, ~equal to torch baseline * works? * Working version * Partial rewrite * FSDP2 path works * Fix back prepare * Almost done, proper AC left * Feat: should work, cleanup + test more benchmarks left * Style+quality * Feat: fp8 example * Feat: better example * Feat: add readme * Docs + should be done * Fix: typos * Fix: protect imports * Feat: address comments * Feat: add flops image commit b9fee48 Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com> Date: Tue Jun 10 13:24:43 2025 +0100 better handle FP8 with and without deepspeed (#3611) * use the state mixed precision which has undergone all preprocessing * Update src/accelerate/accelerator.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update src/accelerate/accelerator.py * accelerator state sets the mixed precision for deepspeed and fp8_enabled * fix * fix --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> commit 3a82b05 Author: Marc Sun <57196510+SunMarc@users.noreply.github.com> Date: Tue Jun 10 11:29:59 2025 +0200 Fix bf16 training with TP (#3610) * fix * Apply style fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> commit 6b61a37 Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com> Date: Fri Jun 6 13:48:43 2025 +0100 fix deepspeed regional compilation (#3609) commit 682691d Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com> Date: Tue Jun 3 12:36:56 2025 +0200 Update Gaudi Runners (#3593) * test * fix * push * in the morning * fix backend * run first * set habana modules * dynamo backend * trigger * remove on pr * remove on file change commit 791055b Author: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com> Date: Tue Jun 3 12:24:20 2025 +0200 Fix: list object has no attribute keys (#3603) commit 16bf1d8 Author: Yao Matrix <matrix.yao@intel.com> Date: Fri May 30 23:36:34 2025 +0800 enable torchao and pippy test cases on XPU (#3599) * enable torchao and pippy test cases on XPU Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix style Signed-off-by: Matrix YAO <matrix.yao@intel.com> --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> commit ab3c604 Author: Yao Matrix <matrix.yao@intel.com> Date: Fri May 30 23:23:26 2025 +0800 enable big_model_inference on xpu (#3595) * enable big_model_inference on XPU Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix style Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix quality Signed-off-by: Matrix YAO <matrix.yao@intel.com> --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> commit 273799c Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 27 20:08:59 2025 +0800 enable fsdp2 benchmark on XPU (#3590) * enable fsdp2 benchmark on XPU Signed-off-by: Matrix YAO <matrix.yao@intel.com> * add deterministic Signed-off-by: Matrix YAO <matrix.yao@intel.com> --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> commit 43526c5 Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 27 17:44:50 2025 +0800 add device-agnostic GradScaler (#3588) * add device-agnostic GradScaler Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix bug Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix review comments Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix Signed-off-by: Matrix YAO <matrix.yao@intel.com> * format Signed-off-by: Matrix YAO <matrix.yao@intel.com> * Apply style fixes --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> commit 07f2392 Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 27 17:17:18 2025 +0800 change to use torch.device (#3594) Signed-off-by: Matrix YAO <matrix.yao@intel.com> commit ee2f48c Author: Fanli Lin <fanli.lin@intel.com> Date: Tue May 27 17:16:42 2025 +0800 [docs] no hard-coded cuda in the ddp documentation (#3589) * make device-agnostic * refactor commit 4f3abb7 Author: jiqing-feng <jiqing.feng@intel.com> Date: Mon May 26 21:55:10 2025 +0800 Set ccl and KMP param in simple launch (#3575) * Even 1 CPU mechine can also run multi process Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix ccl and kml param setting Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * set master addr only when processes > 1 Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix num process check Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix ccl args check Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com> commit db536cb Author: Yuanzhou Cai <80858000+yuanjua@users.noreply.github.com> Date: Mon May 26 21:08:13 2025 +0800 Fix: Defer Tracker Initialization to Prevent Premature Distributed Setup (#3581) * Fix tracker initialize distributed before InitProcessGroupKwargs * Fix tracker initialize distributed before InitProcessGroupKwargs * Add test for bug #3550 * Improve test for #3550 * Remove redundant code Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * fix style --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> commit 4e9d0de Author: Yao Matrix <matrix.yao@intel.com> Date: Mon May 26 21:05:42 2025 +0800 enable regional_compilation benchmark on xpu (#3592) * enable regional_compilation benchmark on xpu Signed-off-by: Matrix YAO <matrix.yao@intel.com> * Apply style fixes --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> commit 8cb3ace Author: Luiz F. G. dos Santos <luiz.fernando0992@gmail.com> Date: Thu May 22 10:21:54 2025 -0500 Add kwargs to optimizer, scheduler and dataloader using function `accelerator().load_state()` (#3540) * Added artifacts and figure tracking at MLFlow tracker * Added `log_artifact` to the MLFlowTracker * Remove changes * Added kwargs when loading state. * added doc string * Adjusted correct default types of kwargs * Changed the load kwargs to a single one * removed None value from kwargs * fix kwargs for loading the model * removed load_kwargs from optimizer state dict * make load_kwargs a dictionary * revert last changes * reverted load_kwargs * fix docstring * added dict initiation * Fix quality error during PR commit b6d97cb Author: Emmanuel Ferdman <emmanuelferdman@gmail.com> Date: Thu May 22 17:26:31 2025 +0300 Resolve logger warnings (#3582) Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com> commit 33967d4 Author: Francesco Laiti <25352428+laitifranz@users.noreply.github.com> Date: Tue May 20 12:29:53 2025 +0200 Add support for standalone mode when default port is occupied on single node (#3576) * add standalone mode and replace ConnectionError with a warning when the main process port is in use, allowing for automatic port selection * address review feedback: warn on port conflict only for single-node; raise error for multi-node * Apply style fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> commit 5b1fcda Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 20 18:04:24 2025 +0800 enable test_cli & test_example cases on XPU (#3578) * enable test_cli & test_example cases on XPU Signed-off-by: Matrix Yao <matrix.yao@intel.com> * fix style Signed-off-by: Matrix Yao <matrix.yao@intel.com> * fix style Signed-off-by: Matrix Yao <matrix.yao@intel.com> * remove print Signed-off-by: Matrix Yao <matrix.yao@intel.com> * fix ci issue Signed-off-by: YAO Matrix <matrix.yao@intel.com> --------- Signed-off-by: Matrix Yao <matrix.yao@intel.com> Signed-off-by: YAO Matrix <matrix.yao@intel.com> commit f55f053 Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 20 18:02:14 2025 +0800 goodbye torch_ccl (#3580) Signed-off-by: Matrix Yao <matrix.yao@intel.com> commit 1ec99f0 Author: Yao Matrix <yaoweifeng0301@126.com> Date: Mon May 19 17:27:40 2025 +0800 enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU (#3579) * enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU Signed-off-by: Matrix Yao <matrix.yao@intel.com> * fix style Signed-off-by: Matrix Yao <matrix.yao@intel.com> * Update test_load_checkpoint_and_dispatch_with_broadcast.py --------- Signed-off-by: Matrix Yao <matrix.yao@intel.com>

jiqing-feng marked this pull request as draft May 16, 2025 08:18

Even 1 CPU mechine can also run multi process

d9e5b76

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng added 2 commits May 20, 2025 10:44

fix ccl and kml param setting

7f1b141

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Merge branch 'huggingface:main' into tp

9f76402

jiqing-feng marked this pull request as ready for review May 21, 2025 01:42

SunMarc reviewed May 21, 2025

View reviewed changes

src/accelerate/utils/launch.py Outdated Show resolved Hide resolved

src/accelerate/utils/launch.py Show resolved Hide resolved

jiqing-feng added 2 commits May 23, 2025 09:18

set master addr only when processes > 1

65ed0f2

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix num process check

fb2d5a1

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

SunMarc reviewed May 23, 2025

View reviewed changes

src/accelerate/utils/launch.py Outdated Show resolved Hide resolved

fix ccl args check

aba6d74

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

SunMarc approved these changes May 26, 2025

View reviewed changes

SunMarc merged commit 4f3abb7 into huggingface:main May 26, 2025
24 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set ccl and KMP param in simple launch#3575

Set ccl and KMP param in simple launch#3575
SunMarc merged 6 commits intohuggingface:mainfrom
jiqing-feng:tp

jiqing-feng commented May 16, 2025 •

edited

Loading

Uh oh!

jiqing-feng commented May 16, 2025

Uh oh!

yao-matrix commented May 19, 2025 •

edited

Loading

Uh oh!

jiqing-feng commented May 20, 2025

Uh oh!

jiqing-feng commented May 21, 2025

Uh oh!

SunMarc left a comment

Uh oh!

Uh oh!

Uh oh!

SunMarc left a comment

Uh oh!

Uh oh!

jiqing-feng commented May 26, 2025

Uh oh!

SunMarc left a comment

Uh oh!

HuggingFaceDocBuilderDev commented May 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jiqing-feng commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiqing-feng commented May 16, 2025

Uh oh!

yao-matrix commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiqing-feng commented May 20, 2025

Uh oh!

jiqing-feng commented May 21, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jiqing-feng commented May 26, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented May 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jiqing-feng commented May 16, 2025 •

edited

Loading

yao-matrix commented May 19, 2025 •

edited

Loading