Skip to content

Openib dynamic add proc race conditions #1248

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Dec 23, 2015

Conversation

artpol84
Copy link
Contributor

@hjelmn This is relatively big PR so I'm postin it now to make you aware about it and to let you start the review during your daytime.
As for now I can say that this fixes problems I was observing with some of the OSU collectives tests. I've launched MTT to do more general evaluation. I will post the results when I'll get them.
Also this PR is not a final solution. If you'll be OK with it I will provide more fixes on top of it.

The problem was in mca_btl_openib_proc_create. This function may be called
from several places simultaneously:
* from the main thread when somebody wants to do `MPI_Send()` (for example) for
the first time;
* from udcm if the counterpart peer is trying to connect and `mca_btl_openib_get_ep()`
is called.

In this case one of the threads may add an uninitialized proc structure
to the `mca_btl_openib_component.ib_procs` and the other will read it and
treat as initialized.

This commit turns ib_proc initialization into a single atomic operation.
@artpol84
Copy link
Contributor Author

:bot:retest

@artpol84
Copy link
Contributor Author

aside the problems that are fixed with this changeset I'm seeing race conditions here: https://github.com/open-mpi/ompi/blob/master/opal/mca/btl/openib/btl_openib.c#L1021. Although opal_pointer_array_add() has lock-protection this protection is only enabled if OPAL thread support is enabled.
I've configured my OMPI by default with regad to this aspect. And I see that udcm is running in the service thread causing those race conditions.

@artpol84
Copy link
Contributor Author

Update: MTT shows noticeable improvement: 2 errors after fix versus 27 before.

@hjelmn
Copy link
Member

hjelmn commented Dec 21, 2015

Why not also make the locks in add_procs not dependent on opal_using_threads()? The function is not in the critical path.

@hjelmn
Copy link
Member

hjelmn commented Dec 21, 2015

@hppritcha Will probably want this in 2.0.0 final. Should get this merged into master today or tomorrow.

@artpol84
Copy link
Contributor Author

@hjelmn Not sure I understand what you're saying. Can you explain?

2015-12-21 22:25 GMT+06:00 Nathan Hjelm [email protected]:

Why not also make the locks in add_procs not dependent on
opal_using_threads(). The function is not in the critical path.


Reply to this email directly or view it on GitHub
#1248 (comment).

С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

@artpol84
Copy link
Contributor Author

@hjelmn please, note that I've added one additional commit on top of previous ones:
3c2f6d5

this fixes problems with opal_pointer_array_add() I previously mentioned.

@artpol84 artpol84 force-pushed the openib_proc_init_race branch from b39b115 to 3c2f6d5 Compare December 22, 2015 12:33
@hjelmn hjelmn modified the milestones: v2.0.0, v2.x Dec 22, 2015
@hjelmn
Copy link
Member

hjelmn commented Dec 22, 2015

Looks good to me.

@artpol84
Copy link
Contributor Author

@hjelmn ok, here is one more fix that I'd like to submit in this PR.
Tomorrow I'll work on more precise way to account the number of processes. Right now it is possible that one process will be accounted twice. This may only happen in race condition case so shouldn't affect much. But this certainly has to be fixed.
But I suggest to review this as part of the separate PR when this one will be merged.
This change is not yet tested, I'll check it, run MTT and report back in few hours.
This final fix is small and unlikely to cause problems.

@artpol84 artpol84 force-pushed the openib_proc_init_race branch 2 times, most recently from e9f8a1f to aee2c31 Compare December 22, 2015 16:43
@artpol84 artpol84 force-pushed the openib_proc_init_race branch from aee2c31 to 08ad835 Compare December 22, 2015 16:44
@artpol84
Copy link
Contributor Author

@hppritcha when does 2.0.0 scheduled?

@artpol84
Copy link
Contributor Author

@hjelmn All my checks are passed. Please, feel free to merge it.

@jladd-mlnx
Copy link
Member

Well done, Artem!

On Tue, Dec 22, 2015 at 12:55 PM, Artem Polyakov [email protected]
wrote:

@hjelmn https://github.com/hjelmn All my checks are passed. Please,
feel free to merge it.


Reply to this email directly or view it on GitHub
#1248 (comment).

@hppritcha
Copy link
Member

@artpol84 can't speak for @jsquyres but we'll have an rc2 likely in second week of 1/16.

@hppritcha
Copy link
Member

@artpol84 please open a PR on v2.x for this.

@artpol84
Copy link
Contributor Author

Are we ok to merge this?

среда, 23 декабря 2015 г. пользователь Howard Pritchard написал:

@artpol84 https://github.com/artpol84 please open a PR on v2.x for this.


Reply to this email directly or view it on GitHub
#1248 (comment).


Best regards, Artem Polyakov
(Mobile mail)

@hjelmn
Copy link
Member

hjelmn commented Dec 23, 2015

Yup. Ok to merge.

hjelmn added a commit that referenced this pull request Dec 23, 2015
Openib dynamic add proc race conditions
@hjelmn hjelmn merged commit 84d890b into open-mpi:master Dec 23, 2015
@artpol84
Copy link
Contributor Author

Cool, I'll PR to 2.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants