[bug] reduce_aux_losses_tracker_across_ranks  all_reduce bug with num_layers==pp stage

in https://github.com/NVIDIA/Megatron-LM/blob/5153663876b322972d90115ee6b4c3894445c5ac/megatron/core/transformer/moe/moe_utils.py#L756

in **reduce_aux_losses_tracker_across_ranks**, if pp stage == model block，like  num_layers=4, pp=4,     all_reduce run error
     torch.distributed.all_reduce(
            values, group=parallel_state.get_pipeline_model_parallel_group()
        )
if num_layers=4, pp=2, run right

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bug] reduce_aux_losses_tracker_across_ranks all_reduce bug with num_layers==pp stage #2418

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[bug] reduce_aux_losses_tracker_across_ranks all_reduce bug with num_layers==pp stage #2418

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions