-
Notifications
You must be signed in to change notification settings - Fork 13.3k
In 1.25.0, running the rustc tests on a ryzen gets invalid opcodes and eventual reboot #49751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Is this reproducible? Are you aware of the Ryzen SEGV defect? |
On Fri, Apr 06, 2018 at 09:25:46PM -0700, Tatsuyuki Ishi wrote:
Is this reproducible? Are you aware of the Ryzen SEGV defect?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
Link:
#49751 (comment)
Yes, on this machine it reproduces every time. And yes, I am aware
of the "multiple crashes" when stressing early ryzens - one of the
workarounds for that was apparently to disable multithreading : the
Ryzen 3 1300X is a 'detuned' 1500X and it lacks multithreading, only
4 (real) cores.
When compiling other large software packages (e.g. llvm, qt5,
qtwebengine) the machine is fully stable - as indeed when building
rust and firefox, it is only the rust tests which show this problem.
…--
|
After getting results from somebody else who builds on a ryzen, and a long process of trying different possibilities, I now conclude that the debug build generates invalid opcodes on ryzen, but if I build gdb before the build then the tests complete ok (1 failure, issue-37131.rs which needs thumbv6m-none-eabi and I'm only building for x86).. From my syslog: |
1 similar comment
After getting results from somebody else who builds on a ryzen, and a long process of trying different possibilities, I now conclude that the debug build generates invalid opcodes on ryzen, but if I build gdb before the build then the tests complete ok (1 failure, issue-37131.rs which needs thumbv6m-none-eabi and I'm only building for x86).. From my syslog: |
After getting myself confused by the many core/ directories, I actually got core files for ./build/x86_64-unknown-linux-gnu/test/run-pass/core and ./build/x86_64-unknown-linux-gnu/test/run-pass/panic-runtime/core but since the processes seem to have completed I guess the core files are unusable. |
@zarniwhoop73 Are you still still having this issue with the latest stable? |
Yes. From 1.27.0, in my syslog - Jul 1 19:11:05 origin kernel: [37438.310517] traps: abort-on-c-abi.[24649] trap invalid opcode ip:5592e24b4c4b sp:7ffff2aba820 error:0 in abort-on-c-abi.stage2-x86_64-unknown-linux-gnu[5592e24b1000+6000] |
Just a note that I'm now testing 1.31.1 and I get ten traps for invalid opcodes. This does not exactly give me great confidence about the correctness of the test code, or of rust itself, but it seems to be ok when I actually use it. |
It would be helpful if you could open one of the failing tests in a debugger and find out which opcode is causing the problem (and whether the exact opcode remains the same over multiple runs) |
Are there any straightforward instructions for running the full set of tests with a debugger ? I'm on sysvinit, with gdb present and using This has now become more than a mere annoyance - with 1.31.0 the build was usable, on 1.32.0 on an old AMD phenom, things are fine (just 4 errors in the tests, three of which are because I do not build from arm), able to build the current firefox release), but on the ryzen I get 18 segfaults in various lto tests, and trying to build either the latest firefox-65-beta or the firefox-64.0.2 release fails (segfault when trying to compile gkrust). That is with |
(hit enter too soon) - the failures on 1.32.0, for the little that is worth, are with system llvm, both 7.0.0 and 7.0.1, |
If you're using system LLVM, this might be a similar issue to #57762, where the distro LLVM is missing critical patches. |
Thanks, yes the new problems in 1.32.0 do seem to be the same as that. I rebuilt with the shipped rust (and also for ARM as well as X86, trying to get a clean test result - down to one failure). Unfortunately that system was unusable (errors E0484 and E0514, it found previous versions of libs but nothing from 1.32.0). I've blown that away before reviewing what I had in config.toml. Please note that the original problem (traps in the tests, with eventual segfaults logged in the syslog) still persists even without the llvm-related failures. |
Are you sure your CPU is not faulty (is it an early batch)? If you can post step by step instructions how to reproduce it I'll be glad to try. |
On Mon, Jan 21, 2019 at 05:17:55PM +0000, Mateusz Mikuła wrote:
Are you sure your CPU is not faulty (is it an early batch)?
Over the time I built Rust multiple times (without chaning `target-cpu`) and many crates with `target-cpu=native` and never faces such issue on 2 Ryzen based systems.
I'm fairly confident about this CPU (1300X so only 4 real cores, no
hyperthreading), I bought the system in March 2018, it was assembled
to order but obviously I've no idea how long the CPU might have been
in stock. It has been reasonably stable for months (I use -rc linux
kernels, but no more problems than on other boxes). Somebody else
also reported the same traps on a ryzen - they were using systemd
and that prevented the box from rebooting.
I've no idea if this one will still reboot, the 'ulimit' workaround
only failed once, and that was months ago. Also, once I've got a
version of rust which is "good enough" for current firefox I do not
run the tests, and I only try newer versions when I know that
firefox-beta needs them.
If you can post step by step instructions how to reproduce it I'll be glad to try.
My build has barely changed for several months. The sysv version
(systemd instructions are the same) of our last release (BLFS-8.3)
can be seen at:
http://www.linuxfromscratch.org/blfs/view/stable/general/rust.html
In more recent builds I have commented quiet-tests = true because it
got rejected.
The system llvm is currently 7.0.1, but for versions of rustc up to
and including 1.31.0 the process has worked with that, 7.0.0, and
even 6.0.1.
The same instructions also *build* rustc-1.32.0, but that needs a
non-released verison of llvm - 1.32.0 tests get a lot of segfaults
on this box, and also on an intel i5 running in a kvm. The point
is that the traps have been present on this machine ever since I
first ran the rustc testsuite (and the first one appears within a
few minutes of starting to run that, perhaps while test programs
are being compiled).
ĸen
|
On Mon, Jan 21, 2019 at 09:08:23PM +0000, Ken Moffat wrote:
On Mon, Jan 21, 2019 at 05:17:55PM +0000, Mateusz Mikuła wrote:
> If you can post step by step instructions how to reproduce it I'll be glad to try.
>
I got what looks like a working build of 1.32.0, and I was watching
the syslog during the test. I started to run the tests at about
01:47, the first tests themselves began at 01:48:48. I noticed that
ALL the traps were while the rust-test-helpers suite was running,
first was reported as soon as that started (or perhaps even in the
last compile for that suite). And all of them were during that
suite.
But there are a huge number of files called 'core'. 'find | xargs |
file' shows be that a few are real core files for segfaults, but I'm
not sure that any of them are directly related to the traps. And
only two segfaults were logged.
Jan 22 01:50:06 origin kernel: [231365.913230] traps: a[8744] trap invalid opcode ip:55aff9517b6b sp:7ffeeeb778d0 error:0 in a[55aff9516000+3000]
Jan 22 01:50:18 origin kernel: [231378.136222] traps: a[12543] trap invalid opcode ip:7f01ca4ddaef sp:7ffce860c680 error:0 in libstd-be3f5b84c0422bf0.so[7f01ca4d3000+71000]
Jan 22 01:50:18 origin kernel: [231378.171555] traps: a[12545] trap invalid opcode ip:7fee4e6d8aef sp:7ffec7f66300 error:0 in libstd-be3f5b84c0422bf0.so[7fee4e6ce000+71000]
Jan 22 01:51:08 origin kernel: [231428.225047] traps: a[27624] trap invalid opcode ip:56090422c601 sp:7ffce2c96ca0 error:0 in a[56090422b000+2000]
Jan 22 01:52:49 origin kernel: [231528.867689] a[32364]: segfault at 0 ip 0000564b3b6b19e8 sp 00007ffffd054420 error 6 in a[564b3b6b0000+2000]
Jan 22 01:52:49 origin kernel: [231528.867694] Code: 00 00 48 8b b4 24 00 01 00 00 48 85 f6 74 1e 48 8b bc 24 f8 00 00 00 ba 01 00 00 00 ff 15 58 35 00 00 eb 09 ff 15 a8 35 00 00 <c6> 00 01 48 8b 84 24 50 01 00 00 48 c1 e0 03 48 8d 1c 40 31 ed 4c
Jan 22 01:52:52 origin kernel: [231531.901533] a[792]: segfault at 1 ip 000056502d7bd5ae sp 00007ffc78f83c30 error 6 in a[56502d7bc000+2000]
Jan 22 01:52:52 origin kernel: [231531.901538] Code: 48 85 c0 74 16 48 c1 e0 03 48 8d 34 40 ba 08 00 00 00 4c 89 ff ff 15 b9 29 00 00 48 81 c4 a8 01 00 00 5b 41 5c 41 5e 41 5f c3 <48> c7 04 25 01 00 00 00 00 00 00 00 eb 90 48 8d 3d dd 26 00 00 31
Jan 22 01:52:54 origin kernel: [231534.026142] traps: a[1229] trap invalid opcode ip:5577f49a3b3f sp:7ffc5ea68cd0 error:0 in a[5577f49a2000+4000]
|
Indeed I'm running systemd based distro that's why I didn't see any issues. Some tests are designed to crash (take guard pages tests as an example), have your tests failed due to segfaults? |
On Tue, Jan 22, 2019 at 01:50:51PM -0800, Mateusz Mikuła wrote:
Indeed I'm running systemd based distro that's why I didn't see any issues.
I grepped through syslog for opcodes and I'm seeing them, the tests pass for me (although I had to disable 3 ARM and 1 GDB test for unrelated reason).
It'd be great to fix it but it's beyond my skills.
Some tests are designed to crash (take guard pages tests as an example), have your tests failed due to segfaults?
When I first raised this, I assumed (wrongly) that the segfaults
were related to the invalid opcodes. I now realise that segfaults
can be deliberate (e.g. I see similar things when testing perl).
Mostly my tests do not fail due to segfaults - with 1.32.0 and
system llvm yes, segfaults in lto and a broken compiler. With
1.32.0 and the shipped llvm, only 3 ARM tests failed.
Thanks for confirming the invalid opcodes exist.
I'm now thinking that the crash might have been related to older
llvm, or to an older kernel - on reflection, I do not believe that
allowing the core files to be used, or having gdb present, provide
any likely mitigation. So I'm going to try without - but that might
take me a few days, got a lot of rebuilding (to measure time and
space for options, etc) before I'm ready to take the risk of it
crashing.
Meanwhile, the 1.32.0 tests are so much nicer than earlier versions.
|
With the benefit of hindsight, I can see that my "mitigations" of installing gdb and allowing core dumps were wishful thinking. As shown above, the invalid opcodes are real, I think what was happening when I raised this was that I was using an -rc kernel, and something (presumably the invalid opcode, since the reboots only happened when running the tests) caused a triple-fault. But still, as a mere user I find the creation of an invalid opcode in a safe programming language to be unexpected. |
Hmm, looking at a different machine where there had not been any issues, that too had traps for invalid opcodes - so I suppose they are deliberate and only got noticed because I had reboots. Looks nasty, but I guess all is well. Closing. |
What CPU the other machine has got? I believe this issue should remain open unless somebody running Linux on modern Intel platform reports similar opcodes in syslog. |
On Wed, Jan 23, 2019 at 04:25:16AM -0800, Mateusz Mikuła wrote:
What CPU the other machine has got?
I believe this issue should remain open unless somebody running Linux on modern Intel platform reports similar opcodes in syslog.
The other machine, as you suspect, is also an AMD - but an old phenom
II x4 from approximately 2012.
I closed it because searching for github/rust issue invalid opcode
on google found a windows i686 issue which appeared to be deliberate
use of a UD2 "opcode" to cause an abort. I hope to be building and
testing on an intel haswell in the next few days, this time I'll
check the syslog.
|
On Wed, Jan 23, 2019 at 06:11:57PM +0000, Ken Moffat wrote:
I've brought forward the haswell build:
model name : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Jan 23 21:14:23 plexi kernel: [57103.303855] traps: a[28282] trap invalid opcode ip:5591bca81b3b sp:7ffde1837ab0 error:0 in a[5591bca80000+3000]
Jan 23 21:14:32 plexi kernel: [57111.668337] traps: a[32325] trap invalid opcode ip:7f71d9fb7abf sp:7ffcd2fe5cb0 error:0 in libstd-be3f5b84c0422bf0.so[7f71d9fad000+71000]
Jan 23 21:14:32 plexi kernel: [57111.727370] traps: a[32348] trap invalid opcode ip:7fbcf8adbabf sp:7ffe1ce675d0 error:0 in libstd-be3f5b84c0422bf0.so[7fbcf8ad1000+71000]
Jan 23 21:15:05 plexi kernel: [57144.740146] traps: a[14811] trap invalid opcode ip:558e6d08d5d1 sp:7ffeeb354c30 error:0 in a[558e6d08c000+2000]
Jan 23 21:16:09 plexi kernel: [57208.996371] a[19691]: segfault at 0 ip 00005594f08569b8 sp 00007fff1572f0f0 error 6 in a[5594f0855000+2000]
and more of the same. So the traps are perfectly normal.
…--
thread 'main' panicked at 'giraffe',
/tmp/rustc-1.32.0-src/src/test/run-fail/while-panic.rs:17:13
|
I build from source, and I 'm now using a Ryzen 1300X. Before I got that I had stopped running the tests in rustc because they take so long and the results were known. But I'd been using 1.22.1 and for firefox-60 I'll need something newer, so I built 1.25.0. In fact I built it several times, and installed in /opt (I had had weird scripting issues with 1.23 which I never got to the bottom of, but 1.24 on another machine seemed fine). Then I used that to build ff-59.0.2, which I am now running, and a couple of test builds of ff-60b9 with different settings.
Everything looked good, so I decided to install in /usr and to run the tests before installing. This is with the shipped llvm, I built the basic system a week before 6.0 was released. But when I came back to the machine after running the tests, it had rebooted. In my syslog was
``Apr 6 04:44:05 origin kernel: [32318.734501] traps: backtrace.stage[3292] trap invalid opcode ip:7f770d540498 sp:7fff8a04b2f0 error:0 in libstd-23815cc482a70678.so[7f770d4e6000+155000]
Apr 6 04:44:05 origin kernel: [32318.761267] traps: backtrace.stage[3305] trap invalid opcode ip:7f2a670f3498 sp:7ffdf2cd6060 error:0 in libstd-23815cc482a70678.so[7f2a67099000+155000]
Apr 6 04:47:48 origin kernel: [32541.823158] segfault-no-out[16965]: segfault at 0 ip 000055fd937ceb69 sp 00007fff8f94f100 error 6 in segfault-no-out-of-stack.stage2-x86_64-unknown-linux-gnu[55fd937cb000+5000]
Apr 6 04:47:55 origin kernel: [32548.615545] signal-exit-sta[17770]: segfault at 1 ip 000055cc448b9efc sp 00007fff3879c310 error 6 in signal-exit-status.stage2-x86_64-unknown-linux-gnu[55cc448b7000+4000]
Apr 6 04:47:58 origin kernel: [32551.276777] traps: simd-target-fea[18051] trap invalid opcode ip:562ab9dfc89c sp:7fff6be1eb50
error:0 in simd-target-feature-mixup.stage2-x86_64-unknown-linux-gnu[562ab9df9000+7000]
In case it was a weird one-off problem, I retried and got a similar result. On ryzen, a few instructions which were present in previous micro-architectures such as Kaveri are no longer present, but binaries built for older AMD machines do work - so I figured a non-ryzen instruction had crept in. But how to get rid of it ?
If (with 1.22.1 currently in /usr) I use
``rustc -C target-cpu=help
it tells me target-cpu=native will use znver1 which should be fine. But I cannot manage to get it to avoid the invalid instructions (although as I said, without running the tests all seems fine).
The results from google were ambiguous (maybe things have changed over time), but I tried:
export RUSTC_FLAGS='-C target-cpu=native'
and =x86-64
and =skylake (I would expect that to fail earlier, so it suggested that envvar was not recognized)
which all trapped and segfaulted early in the tests (I was attentive and interrupted the builds instead of letting it carry on and reboot)
Then I tried, in config.toml
[build]
rustflags = "-C target-cpu=native"
which was very quickly spat out as an unknown field.
Then I tried
export RUSTFLAGS='-C target-cpu=native'
and =x86-64 and =amdfam10 : all trapped and segfaulted
similarly
export CARGO_BUILD_RUSTFLAGS='-C target-cpu=native'
and =x86-64 and =amdfam10
This is "disappointing". NB - using RUST_BACKTRACE=1 didn't give me any information about what had generated the invalid opcode.
The text was updated successfully, but these errors were encountered: