Skip to content

[v5.0.x] ofi: follow user specified include/exclude list to select providers #12251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 22, 2024

Conversation

wenduwan
Copy link
Contributor

This PR addresses #12233

Since 5.0.x, we introduced an optional FI_HMEM capability in ofi provider selection logic(both mtl and btl) in order to support accelerator memory. As described in the issue, this introduced a bug that can cause the wrong ofi provider to be selected, even if the user explicitly includes/excludes the provider name.

This change refactors the selection logic to correctly handle the include/exclude list, and therefore fixes the bug.

Signed-off-by: Wenduo Wang [email protected]
(cherry picked from commit 29efcef)

This PR addresses open-mpi#12233

Since 5.0.x, we introduced an optional FI_HMEM capability in ofi provider
selection logic(both mtl and btl) in order to support accelerator memory.
As described in the issue, this introduced a bug that can cause the wrong
ofi provider to be selected, even if the user explicitly includes/excludes
the provider name.

This change refactors the selection logic to correctly handle the
include/exclude list, and therefore fixes the bug.

Signed-off-by: Wenduo Wang <[email protected]>
(cherry picked from commit 29efcef)
@shijin-aws
Copy link
Contributor

shijin-aws commented Jan 19, 2024

why don't you backport the FI_REMOTE_COMM addition?

@wenduwan
Copy link
Contributor Author

why don't you backport the FI_REMOTE_COMM addition?

@shijin-aws Added thanks for reminder

@wenduwan
Copy link
Contributor Author

@hppritcha @shijin-aws I removed FI_REMOTE_COMM commit due to #12255

@wenduwan
Copy link
Contributor Author

@tmh97 Tom did you get a chance to verify this change on main branch? We later discovered an issue with commit 3a09492 - we are reverting it. Please exclude it from your test.

@wenduwan wenduwan merged commit 20b79bf into open-mpi:v5.0.x Jan 22, 2024
@tmh97
Copy link

tmh97 commented Jan 22, 2024

@wenduwan giving it a look now! Thanks

@tmh97
Copy link

tmh97 commented Jan 22, 2024

@wenduwan Verified this change! OPX provider is found without issue, I tested using OSU MB's, and with the following options, which are typically preferred for OPX runs --mca mtl ofi --mca btl self,vader

@wenduwan
Copy link
Contributor Author

@tmh97 Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants