Skip to content

Conversation

tianhaodongbd
Copy link
Contributor

@tianhaodongbd tianhaodongbd commented Jun 25, 2025

PR Category

Distributed Strategy

PR Types

Bug fixes

Description

Pcard-90602
在开pp的场景下recreate nccl comm存在hang的问题, 原因是同一个通信组内 tcp通信时unique_key的获取是无序的,导致相互等待。当前通过有序map代替无序map来修复这个问题。

Copy link
Contributor

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@SylarTiaNII SylarTiaNII left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SylarTiaNII SylarTiaNII merged commit 1acc974 into PaddlePaddle:incubate/fleety_20250421 Jun 26, 2025
10 of 14 checks passed
tianhaodongbd added a commit to tianhaodongbd/Paddle that referenced this pull request Jul 22, 2025
tianhaodongbd added a commit to tianhaodongbd/Paddle that referenced this pull request Aug 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants