WIP: Emulate the vsyscall page in userspace in the x86_64 Docker image #157

geofft · 2018-02-09T03:38:10Z

This is an LD_PRELOAD to catch segfaults from a kernel booted with vsyscall=none and turn them into normal syscalls. It works well enough to run bash from wheezy, but I'm not sure how well it works in general and it is probably extremely buggy in the general case. I'm mostly posting it here to run Travis against the Dockerfile and for @markrwilliams' feedback / testing. :) I'll update this comment once it's close to ready for merge.

Since some recent distros are shipping with vsyscall=none by default, the manylinux1 Docker image doesn't work. Fortunately, we can emulate everything in userspace by catching segmentation faults for the vsyscall addresses and forcing the program to execute a normal syscall instead. Install a global preload library to do this, and also attempt to keep other segmentation fault handlers working. This is brittle and isn't intended for anything other than running the Docker image long enough to build some wheels.

njsmith · 2018-02-09T05:01:01Z

Oh my.

From The Linux Programming Interface, p. 452: SIGBUS, SIGFPE, SIGILL, and SIGSEGV can be generated as a consequence of a hardware exception.... SUSv3 specifies that the behavior of a process is undefined if it returns from a handler for the signal, or if it ignores it or blocks the signal... * Blocking the signal: ... On Linux 2.4 and earlier, the kernel simply ignores attempts to block a hardware-generated signal; the signal is delivered to the process anyway, and then either terminates the process or is caught by a signal handler, if one has been established. Starting with Linux 2.6, if the signal is blocked, then the process is always immediately killed by that signal, even if the process has installed a handler for the signal. (The rationale for the Linux 2.6 change in the treatment of blocked hardware-generated signals was that the Linux 2.4 behavior hid bugs and could cause deadlocks in threaded programs.) Some programs, like rpm, aggressively block signals with masks that exclude SIGSEGV. Under normal circumstances with Linux 2.6 or later this always terminate the process. However, with our user-space SIGSEGV-based vsyscall handling, sometimes these programs terminate, as described above, while other times they resume execution at the instruction that accessed the offending address and enter into an infinite loop. Presumably our signal handler would have run with Linux 2.4! This commit patches sigprocmask(2) to remove SIGSEGV from a new signal set in a way that should be invisible to the program that's installing it.

markrwilliams · 2018-02-09T11:06:52Z

Whew!

I've been testing this patch in Docker containers on my desktop with vsyscall=none. It's looking good!

I opened a PR to fix a segfault in yum related to sigprocmask. However, I'm still getting one in gcc even with an empty C file.

I think maybe gcc is installing its own handler?

https://gist.github.com/markrwilliams/786d855e56ca88ba2ea76c4304b68a02#file-gcc-sigsegv-L303

njsmith · 2018-02-09T11:16:13Z

Wouldn't it be simpler and more reliable to MMAP_FIXED an executable page onto the vsyscall address?

...I guess "simpler and more reliable" is not necessarily the goal here though, given that we already have a working patch for glibc.

geofft · 2018-02-09T13:15:11Z

@njsmith It would in fact be much simpler and preferable, but the address is in kernelspace so we can't map it from userspace.

geofft · 2018-02-09T14:44:15Z

A code grep indicates gcc is probably calling signal() instead of sigaction() - try something like ba9c11c. (Probably I won't have time to look in detail until tonight.)

To expand a bit on why I think this is a better plan than patching glibc:

It allows the Docker container to be built on host machines with vsyscall=none (which is useful for local testing and also future-proofs us against potential changes in our CI servers).
It's not specific to the target system - if we decide manylinux3 should be based on Debian or something instead of RHEL, the same library would work.
It works around the problem with Travis timing out on glibc rebuilds.

njsmith · 2018-02-09T18:17:15Z

the address is in kernelspace

Ah, darn.

My general concern is that there's a risk we'll find ourselves hunting down weird corner cases here for several years. These images have a very wide user base, most of whom have no idea what this witchcraft is, and the symptoms are likely to be super obscure, so it'll be hard to even diagnose why things are breaking and get the right people looking at them.

It allows the Docker container to be built on host machines with vsyscall=none (which is useful for local testing and also future-proofs us against potential changes in our CI servers).

One option would be to use this just for building the patched glibc, in case we need to bootstrap that on a system with vsyscall=none.

It's not specific to the target system - if we decide manylinux3 should be based on Debian or something instead of RHEL, the same library would work.

YAGNI. If this unlikely event occurs, we can always reevaluate.

It works around the problem with Travis timing out on glibc rebuilds.

It seems like there are probably other, less risky ways to address this? Do we just need a spinner process to tell Travis that we haven't frozen? What about using circleci or the auto-builders that the image repositories use?

Re: sigprocmask, don't we have the same issue with pthread_sigmask and the pselect family?

geofft · 2018-02-09T18:50:11Z

These images have a very wide user base, most of whom have no idea what this witchcraft is, and the symptoms are likely to be super obscure, so it'll be hard to even diagnose why things are breaking and get the right people looking at them.

One thing I'd really like to do is make this module have no effect on systems booted without vsyscall=none (do not register a segfault handler, have the preloaded functions pass through). Would doing that + also printing a "WARNING: using userspace vsyscall emulation, if you see unexpected segfaults try these things instead" to stderr be helpful?

Then this patch is strictly an improvement - it will only have an effect if you were going to segfault anyway, and it will eliminate some segfaults - hopefully all of them but the worst case is that you still segfault.

Re: sigprocmask, don't we have the same issue with pthread_sigmask and the pselect family?

pthread_sigmask yes, pselect unlikely because a) the usual use is that you mask fewer signals in the mask you pass to pselect (so that those signals can interrupt pselect but don't interrupt your work) and b) it only ends up mattering if someone tries to make a vsyscall from a signal handler.

Which I acknowledge is an argument that this approach is pretty brittle... I was hoping to only do the things needed to compile software using the compilers in the image, but it's certainly true that someone might be wgetting some random commercial compiler during their build or whatever.

(Also, another approach that doesn't actually help the build but might be worth doing: add something to detect if the vsyscall page is missing, print a clear error to stderr about what's going on, and exit instead of just segfaulting.)

markrwilliams · 2018-02-09T21:33:42Z

@geofft patching signal appears to have satisfied gcc! Unfortunately there's a new segfault that appears in the build log:

Running Transaction
  Updating       : tzdata                                                  1/44
  Updating       : glibc-common                                            2/44Non-fatal POSTIN scriptlet failure in rpm package glibc-common-2.5-123.el5_11.3.x86_64

error: %post(glibc-common-2.5-123.el5_11.3.x86_64) scriptlet failed, signal 11
  Updating       : glibc                                                   3/44

The only %post scriptlet for glibc-common in the SPEC file is:

%post common -p /usr/sbin/build-locale-archive

Running /usr/sbin/build-locale-archive in a CentOS 5.11 Docker image with vsyscall-emu.so indeed segfaults in a disappointing way:

# gdb -q /usr/sbin/build-locale-archive
Reading symbols from /usr/sbin/build-locale-archive...(no debugging symbols found)...done.
(gdb) run
Starting program: /usr/sbin/build-locale-archive 
warning: Error disabling address space randomization: Operation not permitted

Program received signal SIGSEGV, Segmentation fault.
0xffffffffff600000 in ?? ()
(gdb) bt
#0  0xffffffffff600000 in ?? ()
#1  0x000000000043ea6d in ?? ()
#2  0x000000000043d82d in ?? ()
#3  0x0000000000401258 in ?? ()
#4  0x00000000004018fd in ?? ()
#5  0x0000000000403f33 in ?? ()
#6  0x000000000040072e in ?? ()
#7  0x00000000004054b0 in ?? ()
#8  0x00000000004001b9 in ?? ()
#9  0x00007fffffffeb18 in ?? ()
#10 0x0000000000000000 in ?? ()

The conspicuous absence of libc.so.6 from the backtrace implies build-locale-archive was statically linked against glibc. Unfortunately this is easy to confirm:

# objdump -D /usr/sbin/build-locale-archive | grep 0xffffffffff600000
  43ea64:       48 c7 c0 00 00 60 ff    mov    $0xffffffffff600000,%rax
# ldd /usr/sbin/build-locale-archive
        not a dynamic executable

I don't think this means this approach is dead in the water yet. An obvious next step is to patch out gettimeofday and any other vsyscalls from build-locale-archive and any other statically linked executables used by glibc. This could be easy: we could use the glibc patch to build a glibc without vsyscalls, then use bsdiff to derive patches between this and the standard, vsyscall-full glibc.

@njsmith I agree that the immediate value of this PR is that it would allow us to patch glibc on a machine without vsyscalls, which in turn would allow us to use Travis or whatever to build the CentOS 6 base image. It would also be useful to people who want to continue to build manylinux1 wheels for some reason, and finally to anybody who needs to run a Docker image that expects vsyscalls be enabled.

I also think it will take non-trivial to get this right :( Maybe it's better to move this to its own repository?

Regardless, this PR has convinced me that patching glibc isn't the only way we can work around the vsyscall problem, and as a result I'm going to change the wording of PEP 571's vsyscall section to indicate that we're doing something that might still result in segmentation faults in edge cases. That will allow us to switch to a different strategy later without editing the PEP.

geofft · 2018-02-10T17:53:47Z

https://github.com/geofft/manylinux/blob/ptrace/docker/vsyscall_emu/vsyscall_trace.c
is a ptrace-based alternative that ought to work well with static binaries, and also has the benefit of not requiring fiddling with signal masks or handlers or anything. That is, I think it is already complete other than error-handling etc. (Compile with -ldl.)

I haven't tested this against Docker, but I have confirmed that attaching it to the bash in my terminal and running a bash + libc from Wheezy causes things to work fine, and there's no perceivable slowdown. As soon as I ^C the tracer, python, bash, etc. start dying again. @markrwilliams (or anyone) - can you try running this on the pid of your Docker daemon on a vsyscall=none host, and see if the CentOS 5 Docker image or the normal manylinux1 Docker image will run?

I'm not familiar enough with Docker to know if we can ship this inside the container unprivileged, or this would just be a separate tool you'd have to manually run (which would require access to the Docker socket but not a reboot, so it should help with CI etc.). Perhaps the right answer is to somehow get this into Docker itself, to preserve Docker's implicit ABI compatibility promise.

geofft · 2018-02-23T05:14:53Z

Closing in favor of #158, which uses the ptrace-based approach.

geofft added 2 commits February 9, 2018 09:39

Add a wrapper for signal

ba9c11c

Remove debugging statements to avoid stdio/stdlib

4b62b83

geofft force-pushed the vsyscall-emu branch 2 times, most recently from 0449b6e to 4b62b83 Compare February 23, 2018 04:10

geofft mentioned this pull request Feb 23, 2018

Emulate the vsyscall page in userspace in the x86_64 Docker image #158

Closed

geofft closed this Feb 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Emulate the vsyscall page in userspace in the x86_64 Docker image #157

WIP: Emulate the vsyscall page in userspace in the x86_64 Docker image #157

Uh oh!

geofft commented Feb 9, 2018

Uh oh!

njsmith commented Feb 9, 2018

Uh oh!

markrwilliams commented Feb 9, 2018

Uh oh!

njsmith commented Feb 9, 2018

Uh oh!

geofft commented Feb 9, 2018

Uh oh!

geofft commented Feb 9, 2018

Uh oh!

njsmith commented Feb 9, 2018

Uh oh!

geofft commented Feb 9, 2018

Uh oh!

markrwilliams commented Feb 9, 2018

Uh oh!

geofft commented Feb 10, 2018

Uh oh!

geofft commented Feb 23, 2018

Uh oh!

Uh oh!

WIP: Emulate the vsyscall page in userspace in the x86_64 Docker image #157

WIP: Emulate the vsyscall page in userspace in the x86_64 Docker image #157

Uh oh!

Conversation

geofft commented Feb 9, 2018

Uh oh!

njsmith commented Feb 9, 2018

Uh oh!

markrwilliams commented Feb 9, 2018

Uh oh!

njsmith commented Feb 9, 2018

Uh oh!

geofft commented Feb 9, 2018

Uh oh!

geofft commented Feb 9, 2018

Uh oh!

njsmith commented Feb 9, 2018

Uh oh!

geofft commented Feb 9, 2018

Uh oh!

markrwilliams commented Feb 9, 2018

Uh oh!

geofft commented Feb 10, 2018

Uh oh!

geofft commented Feb 23, 2018

Uh oh!

Uh oh!