Skip to content

rmaps/base: fix logic (crash, in some cases) when num_procs > num_obj… #7643

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 24, 2020

Conversation

alex--m
Copy link
Contributor

@alex--m alex--m commented Apr 17, 2020

This fixes an issue I encountered, similar to #7311 . Basically, the problem is that when the number of objects is smaller than the number of processes - some processes never get ranked, which results in a "violent" crash with some nasty PMIx errors.

For example, when running mpirun -n 4 --rank-by core hello_world on a single 2-socket machine - the "object" is core. Suppose you have 20 core per socket (40 total) - only the first 4 cores ended up being tested (because num_procs is 4) and the processes mapped to the second socket were not ranked (keeping the rank INVALID and making PMIx unhappy). The fix make sure keep ranking until we exhaust both the processes and the objects.

I also fixed some small issues I found on the way - preventing "partial ranking" (now makes us fallback to the next ranking option), missing NULL-checks and incorrect print-outs.

Signed-off-by: Alex Margolin [email protected]

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@rhc54
Copy link
Contributor

rhc54 commented Apr 17, 2020

That doesn't look correct to me - it sounds like there is something more fundamentally wrong here, and returning not_supported is incorrect. I'll take a look at this.

@awlauria
Copy link
Contributor

ok to test

@alex--m
Copy link
Contributor Author

alex--m commented Apr 17, 2020

@rhc54 granted, "unsupported" may misrepresent the case where not all processes have been ranked during the function call. My thought was to hit that "if (rc == unsupported) && (!specified) { fallback... }".

To help explain the issue I encountered, this output (only partial patch) may help (notice the locale describes the "core" object location, and only reaches the fourth core, where locale bitfield is 0x8000000008):

[hpc@thunder8 orte]$ /mnt/central/users/amargolin/ompi/build/bin/mpirun -np 4 --mca rmaps_base_verbose 9 --display-map --rank-by core [thunder8.thunder:35160] [[44185,0],0] rmaps:base set policy with NULL device NONNULL
[thunder8.thunder:35160] mca:rmaps:select: checking available component mindist
[thunder8.thunder:35160] mca:rmaps:select: Querying component [mindist]
[thunder8.thunder:35160] mca:rmaps:select: checking available component ppr
[thunder8.thunder:35160] mca:rmaps:select: Querying component [ppr]
[thunder8.thunder:35160] mca:rmaps:select: checking available component rank_file
[thunder8.thunder:35160] mca:rmaps:select: Querying component [rank_file]
[thunder8.thunder:35160] mca:rmaps:select: checking available component resilient
[thunder8.thunder:35160] mca:rmaps:select: Querying component [resilient]
[thunder8.thunder:35160] mca:rmaps:select: checking available component round_robin
[thunder8.thunder:35160] mca:rmaps:select: Querying component [round_robin]
[thunder8.thunder:35160] mca:rmaps:select: checking available component seq
[thunder8.thunder:35160] mca:rmaps:select: Querying component [seq]
[thunder8.thunder:35160] [[44185,0],0]: Final mapper priorities
[thunder8.thunder:35160]        Mapper: ppr Priority: 90
[thunder8.thunder:35160]        Mapper: seq Priority: 60
[thunder8.thunder:35160]        Mapper: resilient Priority: 40
[thunder8.thunder:35160]        Mapper: mindist Priority: 20
[thunder8.thunder:35160]        Mapper: round_robin Priority: 10
[thunder8.thunder:35160]        Mapper: rank_file Priority: 0
[thunder8.thunder:35160] mca:rmaps: mapping job [44185,1]
[thunder8.thunder:35160] mca:rmaps: setting mapping policies for job [44185,1] nprocs 4
[thunder8.thunder:35160] mca:rmaps[193] mapping not set by user - using bynuma
[thunder8.thunder:35160] mca:rmaps[339] binding not given - using bynuma
[thunder8.thunder:35160] mca:rmaps:ppr: job [44185,1] not using ppr mapper PPR NULL policy PPR NOTSET
[thunder8.thunder:35160] mca:rmaps:seq: job [44185,1] not using seq mapper
[thunder8.thunder:35160] mca:rmaps:resilient: cannot perform initial map of job [44185,1] - no fault groups
[thunder8.thunder:35160] mca:rmaps:mindist: job [44185,1] not using mindist mapper
[thunder8.thunder:35160] mca:rmaps:rr: mapping job [44185,1]
[thunder8.thunder:35160] AVAILABLE NODES FOR MAPPING:
[thunder8.thunder:35160]     node: thunder8 daemon: 0
[thunder8.thunder:35160] mca:rmaps:rr: mapping no-span by NUMANode for job [44185,1] slots 36 num_procs 4
[thunder8.thunder:35160] mca:rmaps:rr: found 2 NUMANode objects on node thunder8
[thunder8.thunder:35160] mca:rmaps:rr: calculated nprocs 36
[thunder8.thunder:35160] mca:rmaps:rr: assigning nprocs 36
[thunder8.thunder:35160] mca:rmaps: assigning locations for job [44185,1]
[thunder8.thunder:35160] mca:rmaps:ppr: job [44185,1] not using ppr assign: round_robin
[thunder8.thunder:35160] mca:rmaps:resilient: job [44185,1] not using resilient assign: round_robin
[thunder8.thunder:35160] mca:rmaps:mindist: job [44185,1] not using mindist mapper
[thunder8.thunder:35160] mca:rmaps:rr: assign locations for job [44185,1]
[thunder8.thunder:35160] mca:rmaps:rr: assigning locations by NUMANode for job [44185,1]
[thunder8.thunder:35160] mca:rmaps:rr: found 2 NUMANode objects on node thunder8
[thunder8.thunder:35160] RANKING POLICY: CORE
[thunder8.thunder:35160] mca:rmaps: computing ranks by core for job [44185,1]
[thunder8.thunder:35160] mca:rmaps:rank_by: found 36 objects on node thunder8 with 4 procs

CHECK: locale=0x14aaf80 locale->cpuset=3ffff00003ffff obj=0x14985b0 obj->cpuset=1000000001
[thunder8.thunder:35160] mca:rmaps:rank_by: proc in position 0 is on object 0 assigned rank 0
[thunder8.thunder:35160] mca:rmaps:rank_by skipping proc [[44185,1],0] - already ranked, num_ranked 1

CHECK: locale=0x14ab1b0 locale->cpuset=ffc0000ffffc0000 obj=0x1498d30 obj->cpuset=2000000002
[thunder8.thunder:35160] mca:rmaps:rank_by: proc at position 1 is not on object 1

CHECK: locale=0x14aaf80 locale->cpuset=3ffff00003ffff obj=0x1498d30 obj->cpuset=2000000002
[thunder8.thunder:35160] mca:rmaps:rank_by: proc in position 2 is on object 1 assigned rank 1
[thunder8.thunder:35160] mca:rmaps:rank_by skipping proc [[44185,1],0] - already ranked, num_ranked 2

CHECK: locale=0x14ab1b0 locale->cpuset=ffc0000ffffc0000 obj=0x14993b0 obj->cpuset=4000000004
[thunder8.thunder:35160] mca:rmaps:rank_by: proc at position 1 is not on object 2
[thunder8.thunder:35160] mca:rmaps:rank_by skipping proc [[44185,1],1] - already ranked, num_ranked 2

CHECK: locale=0x14ab1b0 locale->cpuset=ffc0000ffffc0000 obj=0x14993b0 obj->cpuset=4000000004
[thunder8.thunder:35160] mca:rmaps:rank_by: proc at position 3 is not on object 2
[thunder8.thunder:35160] mca:rmaps:rank_by skipping proc [[44185,1],0] - already ranked, num_ranked 2

CHECK: locale=0x14ab1b0 locale->cpuset=ffc0000ffffc0000 obj=0x1499a30 obj->cpuset=8000000008
[thunder8.thunder:35160] mca:rmaps:rank_by: proc at position 1 is not on object 3
[thunder8.thunder:35160] mca:rmaps:rank_by skipping proc [[44185,1],1] - already ranked, num_ranked 2

CHECK: locale=0x14ab1b0 locale->cpuset=ffc0000ffffc0000 obj=0x1499a30 obj->cpuset=8000000008
[thunder8.thunder:35160] mca:rmaps:rank_by: proc at position 3 is not on object 3
[thunder8.thunder:35160] [[44185,0],0] ORTE_ERROR_LOG: Not supported in file base/rmaps_base_ranking.c at line 594
[thunder8.thunder:35160] [[44185,0],0] ORTE_ERROR_LOG: Not supported in file base/odls_base_default_fns.c at line 632
[hpc@thunder8 orte]$

@alex--m
Copy link
Contributor Author

alex--m commented Apr 18, 2020

@rhc54 This one seems to have been closed because i wrote the word "fix" before refering to this patch, but IMHO should still be applied... Sorry for not being clear about the difference between this patch and the one for prrte.

Every mapper is required to set the locale, which is why it is an error
if the locale attribute isn't found. Likewise, it is an error for any
mapper to set a NULL locale as it makes no sense. However, I can see
that maybe some compiler or static code checker might want to see
concrete evidence we checked it - so check it in the right place.

Backport the equivalent code from PRRTE as we know that works - more
confidence than trying to add another patch to this old code.

Signed-off-by: Ralph Castain <[email protected]>
@rhc54 rhc54 reopened this Apr 18, 2020
@rhc54
Copy link
Contributor

rhc54 commented Apr 18, 2020

Yeah, github can catch you that way. I've brought over the changes from PRRTE and refreshed the area of the code that had the bug. I'd prefer to keep the two in sync rather than worry about having to validate a unique patch over every use-case again - takes a long time to cover all the corners.

@rhc54 rhc54 added bug and removed ⚠️ WIP-DNM! labels Apr 18, 2020
@rhc54 rhc54 added this to the v4.0.4 milestone Apr 18, 2020
Copy link
Contributor

@rhc54 rhc54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracks PRRTE openpmix/prrte#512 but is not a direct cherry-pick of:

openpmix/prrte@c32127b
openpmix/prrte@a60fb1c

from that repo.

@alex--m
Copy link
Contributor Author

alex--m commented Apr 18, 2020

Yeah, github can catch you that way. I've brought over the changes from PRRTE and refreshed the area of the code that had the bug. I'd prefer to keep the two in sync rather than worry about having to validate a unique patch over every use-case again - takes a long time to cover all the corners.

makes sense, thanks!

@rhc54
Copy link
Contributor

rhc54 commented Apr 18, 2020

@alex--m Thank you for not only finding the problem, but providing a fix!

@gpaulsen gpaulsen merged commit 64b78cc into open-mpi:v4.0.x Apr 24, 2020
@alex--m alex--m deleted the topic/rmaps_fix branch July 23, 2020 06:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants