-
Notifications
You must be signed in to change notification settings - Fork 903
rmaps/base: fix logic (crash, in some cases) when num_procs > num_obj… #7643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ects Signed-off-by: Alex Margolin <[email protected]>
Can one of the admins verify this patch? |
That doesn't look correct to me - it sounds like there is something more fundamentally wrong here, and returning |
ok to test |
@rhc54 granted, "unsupported" may misrepresent the case where not all processes have been ranked during the function call. My thought was to hit that "if (rc == unsupported) && (!specified) { fallback... }". To help explain the issue I encountered, this output (only partial patch) may help (notice the locale describes the "core" object location, and only reaches the fourth core, where locale bitfield is 0x8000000008):
|
@rhc54 This one seems to have been closed because i wrote the word "fix" before refering to this patch, but IMHO should still be applied... Sorry for not being clear about the difference between this patch and the one for prrte. |
Every mapper is required to set the locale, which is why it is an error if the locale attribute isn't found. Likewise, it is an error for any mapper to set a NULL locale as it makes no sense. However, I can see that maybe some compiler or static code checker might want to see concrete evidence we checked it - so check it in the right place. Backport the equivalent code from PRRTE as we know that works - more confidence than trying to add another patch to this old code. Signed-off-by: Ralph Castain <[email protected]>
Yeah, github can catch you that way. I've brought over the changes from PRRTE and refreshed the area of the code that had the bug. I'd prefer to keep the two in sync rather than worry about having to validate a unique patch over every use-case again - takes a long time to cover all the corners. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracks PRRTE openpmix/prrte#512 but is not a direct cherry-pick of:
openpmix/prrte@c32127b
openpmix/prrte@a60fb1c
from that repo.
makes sense, thanks! |
@alex--m Thank you for not only finding the problem, but providing a fix! |
This fixes an issue I encountered, similar to #7311 . Basically, the problem is that when the number of objects is smaller than the number of processes - some processes never get ranked, which results in a "violent" crash with some nasty PMIx errors.
For example, when running
mpirun -n 4 --rank-by core hello_world
on a single 2-socket machine - the "object" is core. Suppose you have 20 core per socket (40 total) - only the first 4 cores ended up being tested (because num_procs is 4) and the processes mapped to the second socket were not ranked (keeping the rank INVALID and making PMIx unhappy). The fix make sure keep ranking until we exhaust both the processes and the objects.I also fixed some small issues I found on the way - preventing "partial ranking" (now makes us fallback to the next ranking option), missing NULL-checks and incorrect print-outs.
Signed-off-by: Alex Margolin [email protected]