Skip to content

In 1.25.0, running the rustc tests on a ryzen gets invalid opcodes and eventual reboot #49751

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zarniwhoop73 opened this issue Apr 7, 2018 · 23 comments
Labels
A-testsuite Area: The testsuite used to check the correctness of rustc C-bug Category: This is a bug. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@zarniwhoop73
Copy link

I build from source, and I 'm now using a Ryzen 1300X. Before I got that I had stopped running the tests in rustc because they take so long and the results were known. But I'd been using 1.22.1 and for firefox-60 I'll need something newer, so I built 1.25.0. In fact I built it several times, and installed in /opt (I had had weird scripting issues with 1.23 which I never got to the bottom of, but 1.24 on another machine seemed fine). Then I used that to build ff-59.0.2, which I am now running, and a couple of test builds of ff-60b9 with different settings.

Everything looked good, so I decided to install in /usr and to run the tests before installing. This is with the shipped llvm, I built the basic system a week before 6.0 was released. But when I came back to the machine after running the tests, it had rebooted. In my syslog was

``Apr 6 04:44:05 origin kernel: [32318.734501] traps: backtrace.stage[3292] trap invalid opcode ip:7f770d540498 sp:7fff8a04b2f0 error:0 in libstd-23815cc482a70678.so[7f770d4e6000+155000]
Apr 6 04:44:05 origin kernel: [32318.761267] traps: backtrace.stage[3305] trap invalid opcode ip:7f2a670f3498 sp:7ffdf2cd6060 error:0 in libstd-23815cc482a70678.so[7f2a67099000+155000]
Apr 6 04:47:48 origin kernel: [32541.823158] segfault-no-out[16965]: segfault at 0 ip 000055fd937ceb69 sp 00007fff8f94f100 error 6 in segfault-no-out-of-stack.stage2-x86_64-unknown-linux-gnu[55fd937cb000+5000]
Apr 6 04:47:55 origin kernel: [32548.615545] signal-exit-sta[17770]: segfault at 1 ip 000055cc448b9efc sp 00007fff3879c310 error 6 in signal-exit-status.stage2-x86_64-unknown-linux-gnu[55cc448b7000+4000]
Apr 6 04:47:58 origin kernel: [32551.276777] traps: simd-target-fea[18051] trap invalid opcode ip:562ab9dfc89c sp:7fff6be1eb50
error:0 in simd-target-feature-mixup.stage2-x86_64-unknown-linux-gnu[562ab9df9000+7000]

In case it was a weird one-off problem, I retried and got a similar result. On ryzen, a few instructions which were present in previous micro-architectures such as Kaveri are no longer present, but binaries built for older AMD machines do work - so I figured a non-ryzen instruction had crept in. But how to get rid of it ?

If (with 1.22.1 currently in /usr) I use
``rustc -C target-cpu=help

it tells me target-cpu=native will use znver1 which should be fine. But I cannot manage to get it to avoid the invalid instructions (although as I said, without running the tests all seems fine).

The results from google were ambiguous (maybe things have changed over time), but I tried:

export RUSTC_FLAGS='-C target-cpu=native'
and =x86-64
and =skylake (I would expect that to fail earlier, so it suggested that envvar was not recognized)
which all trapped and segfaulted early in the tests (I was attentive and interrupted the builds instead of letting it carry on and reboot)

Then I tried, in config.toml

[build]
rustflags = "-C target-cpu=native"

which was very quickly spat out as an unknown field.

Then I tried

export RUSTFLAGS='-C target-cpu=native'
and =x86-64 and =amdfam10 : all trapped and segfaulted

similarly
export CARGO_BUILD_RUSTFLAGS='-C target-cpu=native'
and =x86-64 and =amdfam10

This is "disappointing". NB - using RUST_BACKTRACE=1 didn't give me any information about what had generated the invalid opcode.

@ishitatsuyuki
Copy link
Contributor

Is this reproducible? Are you aware of the Ryzen SEGV defect?

@zarniwhoop73
Copy link
Author

zarniwhoop73 commented Apr 7, 2018 via email

@zarniwhoop73
Copy link
Author

After getting results from somebody else who builds on a ryzen, and a long process of trying different possibilities, I now conclude that the debug build generates invalid opcodes on ryzen, but if I build gdb before the build then the tests complete ok (1 failure, issue-37131.rs which needs thumbv6m-none-eabi and I'm only building for x86)..

From my syslog:
Apr 10 20:59:57 origin kernel: [ 8467.800995] traps: backtrace.stage[20598] trap invalid opcode ip:7fa7d64064c8 sp:7ffc11b82980 error:0 in libstd-23815cc482a70678.so[7fa7d6398000+136000]
Apr 10 20:59:57 origin kernel: [ 8467.857580] traps: backtrace.stage[20610] trap invalid opcode ip:7fb046faf4c8 sp:7fffcf095770 error:0 in libstd-23815cc482a70678.so[7fb046f41000+136000]
Apr 10 21:03:23 origin kernel: [ 8673.748998] segfault-no-out[2023]: segfault at 0 ip 00005603f228bb59 sp 00007ffd7a742310 error 6 in segfault-no-out-of-stack.stage2-x86_64-unknown-linux-gnu[5603f2288000+5000]
Apr 10 21:03:30 origin kernel: [ 8680.380639] signal-exit-sta[2943]: segfault at 1 ip 0000559ab8dceeec sp 00007ffd56770ec0 error 6 in signal-exit-status.stage2-x86_64-unknown-linux-gnu[559ab8dcc000+4000]
Apr 10 21:03:32 origin kernel: [ 8682.536323] traps: simd-target-fea[3192] trap invalid opcode ip:56396ae7589c sp:7fff4ab39590 error:0 in simd-target-feature-mixup.stage2-x86_64-unknown-linux-gnu[56396ae72000+7000]

1 similar comment
@zarniwhoop73
Copy link
Author

After getting results from somebody else who builds on a ryzen, and a long process of trying different possibilities, I now conclude that the debug build generates invalid opcodes on ryzen, but if I build gdb before the build then the tests complete ok (1 failure, issue-37131.rs which needs thumbv6m-none-eabi and I'm only building for x86)..

From my syslog:
Apr 10 20:59:57 origin kernel: [ 8467.800995] traps: backtrace.stage[20598] trap invalid opcode ip:7fa7d64064c8 sp:7ffc11b82980 error:0 in libstd-23815cc482a70678.so[7fa7d6398000+136000]
Apr 10 20:59:57 origin kernel: [ 8467.857580] traps: backtrace.stage[20610] trap invalid opcode ip:7fb046faf4c8 sp:7fffcf095770 error:0 in libstd-23815cc482a70678.so[7fb046f41000+136000]
Apr 10 21:03:23 origin kernel: [ 8673.748998] segfault-no-out[2023]: segfault at 0 ip 00005603f228bb59 sp 00007ffd7a742310 error 6 in segfault-no-out-of-stack.stage2-x86_64-unknown-linux-gnu[5603f2288000+5000]
Apr 10 21:03:30 origin kernel: [ 8680.380639] signal-exit-sta[2943]: segfault at 1 ip 0000559ab8dceeec sp 00007ffd56770ec0 error 6 in signal-exit-status.stage2-x86_64-unknown-linux-gnu[559ab8dcc000+4000]
Apr 10 21:03:32 origin kernel: [ 8682.536323] traps: simd-target-fea[3192] trap invalid opcode ip:56396ae7589c sp:7fff4ab39590 error:0 in simd-target-feature-mixup.stage2-x86_64-unknown-linux-gnu[56396ae72000+7000]

@zarniwhoop73
Copy link
Author

After getting myself confused by the many core/ directories, I actually got core files for ./build/x86_64-unknown-linux-gnu/test/run-pass/core and ./build/x86_64-unknown-linux-gnu/test/run-pass/panic-runtime/core but since the processes seem to have completed I guess the core files are unusable.

@XAMPPRocky
Copy link
Member

@zarniwhoop73 Are you still still having this issue with the latest stable?

@XAMPPRocky XAMPPRocky added A-testsuite Area: The testsuite used to check the correctness of rustc T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. C-bug Category: This is a bug. labels Jun 30, 2018
@zarniwhoop73
Copy link
Author

Yes. From 1.27.0, in my syslog -

Jul 1 19:11:05 origin kernel: [37438.310517] traps: abort-on-c-abi.[24649] trap invalid opcode ip:5592e24b4c4b sp:7ffff2aba820 error:0 in abort-on-c-abi.stage2-x86_64-unknown-linux-gnu[5592e24b1000+6000]
Jul 1 19:11:15 origin kernel: [37448.106185] traps: backtrace.stage[27617] trap invalid opcode ip:7f8346820184 sp:7fffb2399780 error:0 in libstd-dcb7ecd0c7e4a1a0.so[7f83467c3000+183000]
Jul 1 19:11:15 origin kernel: [37448.148619] traps: backtrace.stage[27631] trap invalid opcode ip:7f3ef1c44184 sp:7ffe110a0070 error:0 in libstd-dcb7ecd0c7e4a1a0.so[7f3ef1be7000+183000]
Jul 1 19:14:11 origin kernel: [37624.541303] segfault-no-out[7802]: segfault at 0 ip 00005601f8151619 sp 00007ffd663019a0 error 6 in segfault-no-out-of-stack.stage2-x86_64-unknown-linux-gnu[5601f814e000+5000]
Jul 1 19:14:17 origin kernel: [37630.456909] signal-exit-sta[8580]: segfault at 1 ip 000055cbb516ebec sp 00007ffd18f9efc0 error 6 in signal-exit-status.stage2-x86_64-unknown-linux-gnu[55cbb516c000+4000]
Jul 1 19:14:20 origin kernel: [37632.890457] traps: simd-target-fea[8891] trap invalid opcode ip:55b805051d4c sp:7ffdf238cb10 error:0 in simd-target-feature-mixup.stage2-x86_64-unknown-linux-gnu[55b80504e000+7000]

@zarniwhoop73
Copy link
Author

Just a note that I'm now testing 1.31.1 and I get ten traps for invalid opcodes.

This does not exactly give me great confidence about the correctness of the test code, or of rust itself, but it seems to be ok when I actually use it.

@jonas-schievink
Copy link
Contributor

It would be helpful if you could open one of the failing tests in a debugger and find out which opcode is causing the problem (and whether the exact opcode remains the same over multiple runs)

@zarniwhoop73
Copy link
Author

Are there any straightforward instructions for running the full set of tests with a debugger ? I'm on sysvinit, with gdb present and using
ulimit -c unlimited && ./x.py test --verbose --no-fail-fast
the tests all get run, without crashing the machine (that was the original problem), but no core dumps are created and I therefore have no way of using gdb.

This has now become more than a mere annoyance - with 1.31.0 the build was usable, on 1.32.0 on an old AMD phenom, things are fine (just 4 errors in the tests, three of which are because I do not build from arm), able to build the current firefox release), but on the ryzen I get 18 segfaults in various lto tests, and trying to build either the latest firefox-65-beta or the firefox-64.0.2 release fails (segfault when trying to compile gkrust).

That is with

@zarniwhoop73
Copy link
Author

(hit enter too soon) - the failures on 1.32.0, for the little that is worth, are with system llvm, both 7.0.0 and 7.0.1,

@jonas-schievink
Copy link
Contributor

If you're using system LLVM, this might be a similar issue to #57762, where the distro LLVM is missing critical patches.

@zarniwhoop73
Copy link
Author

Thanks, yes the new problems in 1.32.0 do seem to be the same as that. I rebuilt with the shipped rust (and also for ARM as well as X86, trying to get a clean test result - down to one failure). Unfortunately that system was unusable (errors E0484 and E0514, it found previous versions of libs but nothing from 1.32.0).

I've blown that away before reviewing what I had in config.toml.

Please note that the original problem (traps in the tests, with eventual segfaults logged in the syslog) still persists even without the llvm-related failures.

@mati865
Copy link
Contributor

mati865 commented Jan 21, 2019

Are you sure your CPU is not faulty (is it an early batch)?
Over the time I built Rust multiple times (without chaning target-cpu) and many crates with target-cpu=native and never faces such issue on 2 Ryzen based systems.

If you can post step by step instructions how to reproduce it I'll be glad to try.

@zarniwhoop73
Copy link
Author

zarniwhoop73 commented Jan 21, 2019 via email

@zarniwhoop73
Copy link
Author

zarniwhoop73 commented Jan 22, 2019 via email

@mati865
Copy link
Contributor

mati865 commented Jan 22, 2019

Indeed I'm running systemd based distro that's why I didn't see any issues.
I grepped through syslog for opcodes and I'm seeing them, the tests pass for me (although I had to disable 3 ARM and 1 GDB test for unrelated reason).
It'd be great to fix it but it's beyond my skills.

Some tests are designed to crash (take guard pages tests as an example), have your tests failed due to segfaults?

@zarniwhoop73
Copy link
Author

zarniwhoop73 commented Jan 22, 2019 via email

@zarniwhoop73
Copy link
Author

With the benefit of hindsight, I can see that my "mitigations" of installing gdb and allowing core dumps were wishful thinking. As shown above, the invalid opcodes are real, I think what was happening when I raised this was that I was using an -rc kernel, and something (presumably the invalid opcode, since the reboots only happened when running the tests) caused a triple-fault.

But still, as a mere user I find the creation of an invalid opcode in a safe programming language to be unexpected.

@zarniwhoop73
Copy link
Author

Hmm, looking at a different machine where there had not been any issues, that too had traps for invalid opcodes - so I suppose they are deliberate and only got noticed because I had reboots.

Looks nasty, but I guess all is well. Closing.

@mati865
Copy link
Contributor

mati865 commented Jan 23, 2019

What CPU the other machine has got?

I believe this issue should remain open unless somebody running Linux on modern Intel platform reports similar opcodes in syslog.

@zarniwhoop73
Copy link
Author

zarniwhoop73 commented Jan 23, 2019 via email

@zarniwhoop73
Copy link
Author

zarniwhoop73 commented Jan 23, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testsuite Area: The testsuite used to check the correctness of rustc C-bug Category: This is a bug. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants