Skip to content

WIP: Emulate the vsyscall page in userspace in the x86_64 Docker image #157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

geofft
Copy link

@geofft geofft commented Feb 9, 2018

This is an LD_PRELOAD to catch segfaults from a kernel booted with vsyscall=none and turn them into normal syscalls. It works well enough to run bash from wheezy, but I'm not sure how well it works in general and it is probably extremely buggy in the general case. I'm mostly posting it here to run Travis against the Dockerfile and for @markrwilliams' feedback / testing. :) I'll update this comment once it's close to ready for merge.

Since some recent distros are shipping with vsyscall=none by default,
the manylinux1 Docker image doesn't work. Fortunately, we can emulate
everything in userspace by catching segmentation faults for the vsyscall
addresses and forcing the program to execute a normal syscall instead.

Install a global preload library to do this, and also attempt to keep
other segmentation fault handlers working. This is brittle and isn't
intended for anything other than running the Docker image long enough to
build some wheels.
@njsmith
Copy link
Member

njsmith commented Feb 9, 2018

Oh my.

From The Linux Programming Interface, p. 452:

    SIGBUS, SIGFPE, SIGILL, and SIGSEGV can be generated as a
    consequence of a hardware exception.... SUSv3 specifies that the
    behavior of a process is undefined if it returns from a handler
    for the signal, or if it ignores it or blocks the signal...

    * Blocking the signal: ... On Linux 2.4 and earlier, the kernel
      simply ignores attempts to block a hardware-generated signal;
      the signal is delivered to the process anyway, and then either
      terminates the process or is caught by a signal handler, if one
      has been established. Starting with Linux 2.6, if the signal is
      blocked, then the process is always immediately killed by that
      signal, even if the process has installed a handler for the
      signal. (The rationale for the Linux 2.6 change in the treatment
      of blocked hardware-generated signals was that the Linux 2.4
      behavior hid bugs and could cause deadlocks in threaded
      programs.)

Some programs, like rpm, aggressively block signals with masks that
exclude SIGSEGV.  Under normal circumstances with Linux 2.6 or later
this always terminate the process.

However, with our user-space SIGSEGV-based vsyscall handling,
sometimes these programs terminate, as described above, while other
times they resume execution at the instruction that accessed the
offending address and enter into an infinite loop.  Presumably our
signal handler would have run with Linux 2.4!

This commit patches sigprocmask(2) to remove SIGSEGV from a new signal
set in a way that should be invisible to the program that's installing
it.
@markrwilliams
Copy link

Whew!

I've been testing this patch in Docker containers on my desktop with vsyscall=none. It's looking good!

I opened a PR to fix a segfault in yum related to sigprocmask. However, I'm still getting one in gcc even with an empty C file.

I think maybe gcc is installing its own handler?

https://gist.github.com/markrwilliams/786d855e56ca88ba2ea76c4304b68a02#file-gcc-sigsegv-L303

@njsmith
Copy link
Member

njsmith commented Feb 9, 2018

Wouldn't it be simpler and more reliable to MMAP_FIXED an executable page onto the vsyscall address?

...I guess "simpler and more reliable" is not necessarily the goal here though, given that we already have a working patch for glibc.

@geofft
Copy link
Author

geofft commented Feb 9, 2018

@njsmith It would in fact be much simpler and preferable, but the address is in kernelspace so we can't map it from userspace.

@geofft
Copy link
Author

geofft commented Feb 9, 2018

A code grep indicates gcc is probably calling signal() instead of sigaction() - try something like ba9c11c. (Probably I won't have time to look in detail until tonight.)

To expand a bit on why I think this is a better plan than patching glibc:

  • It allows the Docker container to be built on host machines with vsyscall=none (which is useful for local testing and also future-proofs us against potential changes in our CI servers).
  • It's not specific to the target system - if we decide manylinux3 should be based on Debian or something instead of RHEL, the same library would work.
  • It works around the problem with Travis timing out on glibc rebuilds.

@njsmith
Copy link
Member

njsmith commented Feb 9, 2018

the address is in kernelspace

Ah, darn.

My general concern is that there's a risk we'll find ourselves hunting down weird corner cases here for several years. These images have a very wide user base, most of whom have no idea what this witchcraft is, and the symptoms are likely to be super obscure, so it'll be hard to even diagnose why things are breaking and get the right people looking at them.

It allows the Docker container to be built on host machines with vsyscall=none (which is useful for local testing and also future-proofs us against potential changes in our CI servers).

One option would be to use this just for building the patched glibc, in case we need to bootstrap that on a system with vsyscall=none.

It's not specific to the target system - if we decide manylinux3 should be based on Debian or something instead of RHEL, the same library would work.

YAGNI. If this unlikely event occurs, we can always reevaluate.

It works around the problem with Travis timing out on glibc rebuilds.

It seems like there are probably other, less risky ways to address this? Do we just need a spinner process to tell Travis that we haven't frozen? What about using circleci or the auto-builders that the image repositories use?

Re: sigprocmask, don't we have the same issue with pthread_sigmask and the pselect family?

@geofft
Copy link
Author

geofft commented Feb 9, 2018

These images have a very wide user base, most of whom have no idea what this witchcraft is, and the symptoms are likely to be super obscure, so it'll be hard to even diagnose why things are breaking and get the right people looking at them.

One thing I'd really like to do is make this module have no effect on systems booted without vsyscall=none (do not register a segfault handler, have the preloaded functions pass through). Would doing that + also printing a "WARNING: using userspace vsyscall emulation, if you see unexpected segfaults try these things instead" to stderr be helpful?

Then this patch is strictly an improvement - it will only have an effect if you were going to segfault anyway, and it will eliminate some segfaults - hopefully all of them but the worst case is that you still segfault.

Re: sigprocmask, don't we have the same issue with pthread_sigmask and the pselect family?

pthread_sigmask yes, pselect unlikely because a) the usual use is that you mask fewer signals in the mask you pass to pselect (so that those signals can interrupt pselect but don't interrupt your work) and b) it only ends up mattering if someone tries to make a vsyscall from a signal handler.

Which I acknowledge is an argument that this approach is pretty brittle... I was hoping to only do the things needed to compile software using the compilers in the image, but it's certainly true that someone might be wgetting some random commercial compiler during their build or whatever.

(Also, another approach that doesn't actually help the build but might be worth doing: add something to detect if the vsyscall page is missing, print a clear error to stderr about what's going on, and exit instead of just segfaulting.)

@markrwilliams
Copy link

@geofft patching signal appears to have satisfied gcc! Unfortunately there's a new segfault that appears in the build log:

Running Transaction
  Updating       : tzdata                                                  1/44
  Updating       : glibc-common                                            2/44Non-fatal POSTIN scriptlet failure in rpm package glibc-common-2.5-123.el5_11.3.x86_64

error: %post(glibc-common-2.5-123.el5_11.3.x86_64) scriptlet failed, signal 11
  Updating       : glibc                                                   3/44

The only %post scriptlet for glibc-common in the SPEC file is:

%post common -p /usr/sbin/build-locale-archive

Running /usr/sbin/build-locale-archive in a CentOS 5.11 Docker image with vsyscall-emu.so indeed segfaults in a disappointing way:

# gdb -q /usr/sbin/build-locale-archive
Reading symbols from /usr/sbin/build-locale-archive...(no debugging symbols found)...done.
(gdb) run
Starting program: /usr/sbin/build-locale-archive 
warning: Error disabling address space randomization: Operation not permitted

Program received signal SIGSEGV, Segmentation fault.
0xffffffffff600000 in ?? ()
(gdb) bt
#0  0xffffffffff600000 in ?? ()
#1  0x000000000043ea6d in ?? ()
#2  0x000000000043d82d in ?? ()
#3  0x0000000000401258 in ?? ()
#4  0x00000000004018fd in ?? ()
#5  0x0000000000403f33 in ?? ()
#6  0x000000000040072e in ?? ()
#7  0x00000000004054b0 in ?? ()
#8  0x00000000004001b9 in ?? ()
#9  0x00007fffffffeb18 in ?? ()
#10 0x0000000000000000 in ?? ()

The conspicuous absence of libc.so.6 from the backtrace implies build-locale-archive was statically linked against glibc. Unfortunately this is easy to confirm:

# objdump -D /usr/sbin/build-locale-archive | grep 0xffffffffff600000
  43ea64:       48 c7 c0 00 00 60 ff    mov    $0xffffffffff600000,%rax
# ldd /usr/sbin/build-locale-archive
        not a dynamic executable

I don't think this means this approach is dead in the water yet. An obvious next step is to patch out gettimeofday and any other vsyscalls from build-locale-archive and any other statically linked executables used by glibc. This could be easy: we could use the glibc patch to build a glibc without vsyscalls, then use bsdiff to derive patches between this and the standard, vsyscall-full glibc.

@njsmith I agree that the immediate value of this PR is that it would allow us to patch glibc on a machine without vsyscalls, which in turn would allow us to use Travis or whatever to build the CentOS 6 base image. It would also be useful to people who want to continue to build manylinux1 wheels for some reason, and finally to anybody who needs to run a Docker image that expects vsyscalls be enabled.

I also think it will take non-trivial to get this right :( Maybe it's better to move this to its own repository?

Regardless, this PR has convinced me that patching glibc isn't the only way we can work around the vsyscall problem, and as a result I'm going to change the wording of PEP 571's vsyscall section to indicate that we're doing something that might still result in segmentation faults in edge cases. That will allow us to switch to a different strategy later without editing the PEP.

@geofft
Copy link
Author

geofft commented Feb 10, 2018

https://github.com/geofft/manylinux/blob/ptrace/docker/vsyscall_emu/vsyscall_trace.c
is a ptrace-based alternative that ought to work well with static binaries, and also has the benefit of not requiring fiddling with signal masks or handlers or anything. That is, I think it is already complete other than error-handling etc. (Compile with -ldl.)

I haven't tested this against Docker, but I have confirmed that attaching it to the bash in my terminal and running a bash + libc from Wheezy causes things to work fine, and there's no perceivable slowdown. As soon as I ^C the tracer, python, bash, etc. start dying again. @markrwilliams (or anyone) - can you try running this on the pid of your Docker daemon on a vsyscall=none host, and see if the CentOS 5 Docker image or the normal manylinux1 Docker image will run?

I'm not familiar enough with Docker to know if we can ship this inside the container unprivileged, or this would just be a separate tool you'd have to manually run (which would require access to the Docker socket but not a reboot, so it should help with CI etc.). Perhaps the right answer is to somehow get this into Docker itself, to preserve Docker's implicit ABI compatibility promise.

@geofft
Copy link
Author

geofft commented Feb 23, 2018

Closing in favor of #158, which uses the ptrace-based approach.

@geofft geofft closed this Feb 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants