Skip to content
This repository was archived by the owner on Sep 30, 2022. It is now read-only.

v2.0.0 patcher fixes #1128

Merged
merged 9 commits into from
May 5, 2016
Merged

v2.0.0 patcher fixes #1128

merged 9 commits into from
May 5, 2016

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented May 3, 2016

Fix a number of issues identified by @PHHargrove in the latest rc.

:bot:assign: @PHHargrove
:bot🏷️bug
:bot:milestone:v2.0.0

hjelmn added 5 commits May 2, 2016 16:31
This commit fixes a compile error when the system has mremap but not
MREMAP_FIXED. In this case we do not care about the value of
new_address as the argument does not exist.

Signed-off-by: Nathan Hjelm <[email protected]>

(cherry picked from open-mpi/ompi@14c34ae)

Signed-off-by: Nathan Hjelm <[email protected]>
Fixed a couple of typos in ia64 code.

Signed-off-by: Nathan Hjelm <[email protected]>

(cherry picked from open-mpi/ompi@f8b3be6)

Signed-off-by: Nathan Hjelm <[email protected]>
Signed-off-by: Nathan Hjelm <[email protected]>

(cherry picked from open-mpi/ompi@52edb43)

Signed-off-by: Nathan Hjelm <[email protected]>
The function signature of mremap on BSD (NetBSD, FreeBSD) differs from
the linux version. Added support for the BSD style of mremap.

Signed-off-by: Nathan Hjelm <[email protected]>

(cherry picked from open-mpi/ompi@eb14b34)

Signed-off-by: Nathan Hjelm <[email protected]>
Add a feature check for clflush before trying to use the clflush
instruction. As far as I can tell there is no equivalent before the
SSE2 instruction set.

Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit open-mpi/ompi@581e47c)
Signed-off-by: Nathan Hjelm <[email protected]>
@ompiteam-bot ompiteam-bot added this to the v2.0.0 milestone May 3, 2016
@PHHargrove
Copy link
Member

@hjelmn
I have reviewed the text of the patches and things look fine as far as I can tell.
As for testing, I am only equipped to test from a release (candidate) tarball (one in which autogen.pl has already been run).
If somebody could provide a URL to such a tarball which corresponds to this PR, then I will retest on the various platforms where I saw issues.

I should note that I see nothing in the PR that would address the patcher SEGVs seen on ARMv6 or PPC

-Paul

@hjelmn
Copy link
Member Author

hjelmn commented May 3, 2016

@PHHargrove Still working on those problems. They are most likely related and I should have a fix for the PPC one tomorrow.

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1592/ for details.

@PHHargrove
Copy link
Member

Nevermind my request for a tarball - I was able to generate one on my own (after hunting down a system w/ new enough autotools).

@PHHargrove
Copy link
Member

@hjelmn

On the Linux/Pentium-III and NetBSD7-i386 systems I get past the previous failure only to hit a new one, introduced by your "checker: check for cflush" commit:

libtool: compile:  gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I/home/phargrov/OMPI/openmpi-pr1128-netbsd7-i386/openmpi-gitclone/opal/mca/patcher -I../../../opal/include -I../../../ompi/include -I../../../oshmem/include -I../../../opal/mca/hwloc/hwloc1112/hwloc/include/private/autogen -I../../../opal/mca/hwloc/hwloc1112/hwloc/include/hwloc/autogen -I../../../ompi/mpiext/cuda/c -I/home/phargrov/OMPI/openmpi-pr1128-netbsd7-i386/openmpi-gitclone -I../../.. -I/home/phargrov/OMPI/openmpi-pr1128-netbsd7-i386/openmpi-gitclone/opal/include -I/home/phargrov/OMPI/openmpi-pr1128-netbsd7-i386/openmpi-gitclone/orte/include -I../../../orte/include -I/home/phargrov/OMPI/openmpi-pr1128-netbsd7-i386/openmpi-gitclone/ompi/include -I/home/phargrov/OMPI/openmpi-pr1128-netbsd7-i386/openmpi-gitclone/oshmem/include -I/home/phargrov/OMPI/openmpi-pr1128-netbsd7-i386/openmpi-gitclone/opal/mca/hwloc/hwloc1112/hwloc/include -I/home/phargrov/OMPI/openmpi-pr1128-netbsd7-i386/BLD/opal/mca/hwloc/hwloc1112/hwloc/include -I/home/phargrov/OMPI/openmpi-pr1128-netbsd7-i386/openmpi-gitclone/opal/mca/event/libevent2022/libevent -I/home/phargrov/OMPI/openmpi-pr1128-netbsd7-i386/openmpi-gitclone/opal/mca/event/libevent2022/libevent/include -I/home/phargrov/OMPI/openmpi-pr1128-netbsd7-i386/BLD/opal/mca/event/libevent2022/libevent/include -g -finline-functions -fno-strict-aliasing -pthread -MT base/patcher_base_patch.lo -MD -MP -MF base/.deps/patcher_base_patch.Tpo -c /home/phargrov/OMPI/openmpi-pr1128-netbsd7-i386/openmpi-gitclone/opal/mca/patcher/base/patcher_base_patch.c  -fPIC -DPIC -o base/.libs/patcher_base_patch.o
/home/phargrov/OMPI/openmpi-pr1128-netbsd7-i386/openmpi-gitclone/opal/mca/patcher/base/patcher_base_patch.c: In function 'flush_and_invalidate_cache':
/home/phargrov/OMPI/openmpi-pr1128-netbsd7-i386/openmpi-gitclone/opal/mca/patcher/base/patcher_base_patch.c:87:9: error: PIC register clobbered by 'ebx' in 'asm'
         __asm__ volatile ("cpuid" :
         ^
*** Error code 1

The problem comes from the fact that "-fPIC" on x86 uses EBX as the base register (and NetBSD and Darwin produce PIE executables by default, even w/o -fPIC). So, clobbering EBX is NOT AN OPTION and you must save/restore it. Code to do the save/restore exists in opal/include/opal/sys/ia32/timer.h:opal_sys_timer_get_cycles(). If you choose to clone that logic, then please drop the "{" and "}" in the inline asm (but not the characters between them) to avoid breaking things with Solaris Studio compilers (see https://www.open-mpi.org/community/lists/devel/2015/07/17585.php).

On IA64 I get past build, but "make check" fails the dl_open test with a SEGV in the memory_patcher code:

Core was generated by `.libs/dlopen_test'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x2000000000630f01 in ?? ()
   from /eng/home/PHHargrove/OMPI/openmpi-pr1128-linux-ia64/INST/lib/libopen-pal.so.20
(gdb) where
#0  0x2000000000630f01 in ?? ()
   from /eng/home/PHHargrove/OMPI/openmpi-pr1128-linux-ia64/INST/lib/libopen-pal.so.20
#1  0x200000000084ce20 in memory_patcher_get_shm_seg_size (shmaddr=0x2000000001000000)
    at /eng/home/PHHargrove/OMPI/openmpi-pr1128-linux-ia64/openmpi-gitclone/opal/mca/memory/patcher/memory_patcher_component.c:316
#2  0x200000000084d2b0 in intercept_shmdt (shmaddr=0x2000000001000000)
    at /eng/home/PHHargrove/OMPI/openmpi-pr1128-linux-ia64/openmpi-gitclone/opal/mca/memory/patcher/memory_patcher_component.c:370
#3  0x2000000000eb1a00 in sysv_runtime_query (module=0x600ffffffffeeec8, priority=0x600ffffffffeeed0,
    hint=0x0)
    at /eng/home/PHHargrove/OMPI/openmpi-pr1128-linux-ia64/openmpi-gitclone/opal/mca/shmem/sysv/shmem_sysv_component.c:203
#4  0x200000000086c5e0 in opal_shmem_base_runtime_query (best_module=0x600ffffffffeef08,
    best_component=0x600ffffffffeef00)
    at /eng/home/PHHargrove/OMPI/openmpi-pr1128-linux-ia64/openmpi-gitclone/opal/mca/shmem/base/shmem_base_select.c:98
#5  0x200000000086cc30 in opal_shmem_base_select ()
    at /eng/home/PHHargrove/OMPI/openmpi-pr1128-linux-ia64/openmpi-gitclone/opal/mca/shmem/base/shmem_base_select.c:196
#6  0x2000000000661630 in opal_init (pargc=0x600ffffffffeef60, pargv=0x600ffffffffeef68)
    at /eng/home/PHHargrove/OMPI/openmpi-pr1128-linux-ia64/openmpi-gitclone/opal/runtime/opal_init.c:494
#7  0x40000000000018b0 in main (argc=1, argv=0x600ffffffffef208)
    at /eng/home/PHHargrove/OMPI/openmpi-pr1128-linux-ia64/openmpi-gitclone/ompi/debuggers/dlopen_test.c:134

On the old RHEL AS4 system (where MREMAP_FIXED is in linux/mman.h), I was able to build. Additionally, I verified that MREMAP_FIXED was actually used on that system, since you had made it optional.

On NetBSD-7/amd64 I had to --enable-mca-no-build=io-ompio to get around an unrelated issue.
However, things work fine with your changes on that platform as far as I can tell.

And one additional observation for you:

/home/phhargrove/OMPI/openmpi-pr1128-gilbert-no-pshm/openmpi-gitclone/opal/mca/memory/patcher/memory_patcher_component.c:370: warning: passing arg 1 of `opal_mem_hooks_release_hook' discards qualifiers from pointer target type

@PHHargrove
Copy link
Member

:bot:assign: @hjelmn

@ompiteam-bot ompiteam-bot assigned hjelmn and unassigned PHHargrove May 3, 2016
ebx can not be clobbered when using -fPIC so save and restore the
register instead of allowing it to be clobbered.

Signed-off-by: Nathan Hjelm <[email protected]>

(cherry picked from open-mpi/ompi@2d0e2b6)

Signed-off-by: Nathan Hjelm <[email protected]>
@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1606/ for details.

@PHHargrove
Copy link
Member

@hjelmn
With the fix for preservation of ebx I am able to "make", "make check" and run ring_c on the Linux/Pentium3 and NetBSD/i386 systems which would not "make" previously.

@hjelmn
Copy link
Member Author

hjelmn commented May 4, 2016

Almost done with the patcher fixes. I can now run with ppc32. Just one more bug to track down.

@PHHargrove
Copy link
Member

@hjelmn
Those are famous last words:

Just one more bug to track down

Recall that in addition to PPC, I have seen ARM and IA64 failures in the patcher code, and MIPS failures that I could not isolate due to no working debugger. I can provide pointers to the corresponding devel list postings for ARM and MIPS if needed.

@hjelmn
Copy link
Member Author

hjelmn commented May 4, 2016

@PHHargrove The ARM and PPC issues are more or less the same. I don't have an ARM system to test on but the backtrace is identical to one I saw on PPC. IA64.. Not sure what I can do about that since we don't have any ia64 systems and qemu doesn't support ia64.

@PHHargrove
Copy link
Member

PHHargrove commented May 4, 2016

@hjelmn

I have ARM h/w I could give you an account on if QEMU is too horribly slow.
It is a $35 Raspberry Pi - you should consider buying one.

I do have both MIPS64 and IA64 h/w access, but "second hand" such that I probably cannot get you an account on either.

If there is no way for you and I to fix IA64 support in a reasonable time frame, then I would suggest that "patcher" should be disabled on IA64.
Similarly if you cannot get a positive confirmation that the 2.0.0 RCs have run successfully on SPARC.

@hjelmn
Copy link
Member Author

hjelmn commented May 4, 2016

It looks like the linux patcher isn't quite ready for primetime. It gets all instances of munmap except those called within glibc. I will bring set of patches that will 1) disable overwrite on ia64 (memory hooks on this hardware are not worth the time), 2) disable patcher/linux until it can be fixed, and 3) fix the PPC TOC patching to only apply to PPC64. Once these are in please test.

hjelmn and others added 3 commits May 4, 2016 18:58
The table of contents (TOC) code only appears to only apply to
ppc64. The code was incorrectly assuming the existence of the TOC on
ppc32. This commit updates the necessary code to only apply to ppc64.

Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit open-mpi/ompi@71be36d)
Signed-off-by: Nathan Hjelm <[email protected]>
Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit open-mpi/ompi@6c9a0e1)
Signed-off-by: Nathan Hjelm <[email protected]>
@hjelmn
Copy link
Member Author

hjelmn commented May 5, 2016

@PHHargrove Should be good to go. Includes the patches you just tested on master.

@PHHargrove
Copy link
Member

Tests queued... should have a report late tonight.

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1617/ for details.

@PHHargrove
Copy link
Member

@hjelmn 👍
All the platforms I tested in the PR for master have been tested with this PR's branch.
They all PASS with the exception of ppc64 w/ -m32 which fails to build due to lack of the hwloc fix.
However, ppc32 (on 32-bit h/w) is fine.

@jsquyres
Copy link
Member

jsquyres commented May 5, 2016

@hjelmn Your comment above makes me nervous: "It looks like the linux patcher isn't quite ready for primetime". Indeed -- it looks like you removed the linux patcher component as the last commit in this PR (at last as of now).

Where do we stand with this PR -- are we good to go with the Linux patcher components in v2.0.0?

Also, do we need open-mpi/ompi@ff2a54b in this PR?

@hppritcha
Copy link
Member

I think we do want open-mpi/ompi@ff2a54b, given the reasons why ucx did a cleanup.

@hjelmn
Copy link
Member Author

hjelmn commented May 5, 2016

@jsquyres We are good to go. The overwrite patcher is working on x86, x86_64, ppc, and ppc64. I removed the linux patcher because it misses the munmap calls made by free. UCX gets around that by using the glibc malloc hook interface. I think for now that the overwrite patcher is sufficient and I plan to fix ia64 support and add sparcv9 support in v2.1.0. ia64 is very low priority and sparcv9 is medium.

@hjelmn
Copy link
Member Author

hjelmn commented May 5, 2016

@hppritcha I left open-mpi/ompi@ff2a54b in master and will only bring it over if I can figure out the best way to handle hooking free(). The cleanup was just to remove a lot of duplicate code. The code is now a bit easier to follow.

@jsquyres
Copy link
Member

jsquyres commented May 5, 2016

@hjelmn Ok.

@jsquyres jsquyres merged commit 74d8ea0 into open-mpi:v2.x May 5, 2016
@hppritcha
Copy link
Member

I'm good with this in its current state.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants