-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: sporadic but frequent "signal: illegal instruction" during all.bash on darwin/arm64 at CL 272258 PS 1 #42774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It's worth trying with |
I think that issue is about running Go darwin/amd64 port under Rosetta 2. This is about running the darwin/arm64 port natively. I can try it anyway to see what effect it has, thanks for the suggestion. |
Interesting. Looks similar to #41702. If I'm not mistaken, all failed cases above involve starting processes. Maybe it is more likely to fail on ARM64 than AMD64? Maybe the workaround in #41702 is not effective enough? @tmm1 this is not Rosetta 2, so #42700 is irrelevant. That said, |
I rarely see this on the DTK. I think I never see it since the workaround of #41702 is added. It is interesting that it more likely to fail on M1. Does |
I've also seen this happen a few times. Interestingly the place I ran into it most often is when running the
|
I believe it does:
|
I'm able to reproduce @rolandshoemaker's finding above with I've tried |
Okay, I can reproduce the SIGILL with running the io benchmarks in a loop on the DTK, although with a much lower failure rate. I'll take it from there. (I'll be OOO this afternoon, though, so I probably won't get back to this very soon.) |
Crash report for this error: https://gist.github.com/tonistiigi/23fbd4be8d1562ff935aa36b2dcdb299 With |
@tonistiigi thanks for the crash report. I can reproduce the failure and get similar crash reports. I also managed to get a core dump, which is also quite similar. In the crash report, it says it SIGILLs at It could be the core file being somehow truncated. Or, if it is actually right, somehow a page of system functions is replaced with a page of 0. I'll try to get more core files. I haven't been able to get a crash while running under lldb (to eliminate the possibility of truncated core file). Either way, it seems it crashed in some system function, with SIGILL. I don't see how that could be possible from the Go runtime. Also, normally, when a Go program encounters a SIGILL it will print a stack trace. In this case it doesn't. The only possibilities I can think of is that it crashes before the runtime is initialized, or it SIGILLs while handing signal (so it cannot handle another signal). From the crash report and the core dump, neither seems the case. Maybe the kernel directly kills it when it sees SIGILL in a system library? |
Hmmm, running in the debugger I saw the runtime does a few |
The crash report from @tonistiigi #42774 (comment) doesn’t look like a genuine https://gist.github.com/tonistiigi/23fbd4be8d1562ff935aa36b2dcdb299:
This is suspicious: a genuine illegal instruction in hardware will transform into
Still more suspicious stuff here: for a genuine hardware trap, the Mach exception will be transformed to a signal by the in-kernel
Looks to me like this isn’t a hardware trap at all, but a software Is there anywhere a
If you see a thread in |
I went through all my crash reports. While most of them are almost identical to the one I posted earlier (and with the same pc), I did find two that were quite different. https://gist.github.com/tonistiigi/9d0d40afe5248932cdac266c357d3135 . One of them has a stacktrace to |
No, I don't think the Go program send itself a SIGILL. The |
Here is what I got for a process |
I reproduced this Are others seeing this only on a different OS version than I’m using? Or is the hardware the crucial difference? OK, I only thought I reproduced 0/10 running
|
I did some dtrace. It appears something sends a SIGILL signal. The kernel stack varies. Apparently, the kernel entry points can be a syscall (e.g. madvise), somewhere in user code (not syscall, e.g. runtime.scanobject), or no explicit entry point.
The top 4 frames are essentially always the same, but lower frames vary. The user stack also varies. On contrast, for a program pthread_kill itself with SIGILL,
It is a different code path, and the entry point is always __pthread_kill. I also traced syscall __pthread_kill and I didn't see any SIGILL (all SIGURG). |
That's definite progress! Can you symbolize those kernel frames or even try the debug kernel? |
I don't know how to do that. Tried |
I tried to manually symbolize a kernel stack trace (by comparing the kernel memory with the symbol information of a kernel with a nearby version):
The fifth frame is weird, as it is not a call instruction. Maybe a faked return address? I'll see if I can make sense of it... |
This is another case:
Not quite sure about For comparison, this is the case that it pthread_kill SIGILL itself:
It seems to make sense. |
@cherrymui #42774 (comment), #42774 (comment), #42774 (comment):
What OS version are you using ( On 11.0.1 20B29, the kernel is named
The point I’m getting at is that the kernel on your DTK may be a beta that’s at least 4 months old, either the original 11.0db1 20A5299w that it shipped with, or a very early intermediate beta up to 11.0db3 20a5323l. This is going to be a very different animal than what’s in the factory image now shipping with M1-based machines (11.0 20A2411), the current released kernel for both DTK and M1 (11.0.1 20B29/20B50), and the current beta for both (11.1b1 20C5048k). This isn’t the a kernel that we should be testing against at this point. The KDK for those old releases where there was a separate T8027 kernel never included a |
I have the original 20A5299w kernel. I tried to update the DTK earlier but somehow it failed, and I didn't try again. I understand this is not a good debugging experience. I'll try updating it again at some point. |
I kept looking anyway, and as you may have found, the T8020 and T8027 kernels in those early releases were identical, so I was able to use the T8020
( Nothing too new or interesting there, and we can’t do much with the filenames and line numbers without a source drop. So I looked for other sources of At the bottom of (x86_64 need not feel left out, it has the same Here’s a sample:
(Another note: How to dig deeper from here? Well, the Mach exception type that was recorded (here, Since golang is a heavy user of The remaining question concerns why the user-space signal handler couldn’t be invoked. It could be something like my example here, with an unusable signal stack. Conceivably, it could be stack exhaustion or some other memory problem, or it could be a problem with the user thread state or signal handler thread state. But I think that this new lead will be valuable in troubleshooting this further. |
Interesting. The T8020 kernel I found isn't identical to the running T8027 kernel, just pretty close. There are small changes here and there, like off by one instruction. The relative PCs generally don't match. So I had to do some manual work.
Out of curiosity, I wonder why the kernel chooses SIGILL in this situation. (I hope it is not a typo of SIGKILL...) |
For people who want to try, the CL above disables sigaltstack. It seems to fix this on my machine. |
If you have ADC access, log in to it by clicking Downloads on the ADC page here and then download Kernel_Debug_Kit_11.0_build_20A5299w.dmg. If you don’t have an ADC account, find me internally.
I suspect that the conditions to trigger the bug are merely more likely with a That said,
It’s been there forever: 10.0.0 /xnu-123.5/bsd/dev/ppc/unix_signal.c Very interesting: #41702 is another way to get the same 🦃 |
Change https://golang.org/cl/273686 mentions this issue: |
Apparently, it more likely to fail when the system is under load (e.g. running multiple instances of the test in parallel). So, one assumption is that it likely to fail when the signal stack is not faulted in when the signal happens. So I tried to mlock the signal stack, and it does seem to fix this. CL https://golang.org/cl/273686 does the mlock. For people who want to try, let me know if that works on your machine. Thanks! |
Oh, that’s a good find. I thought that this might have to do with faulting in the signal stack, but despite spending a half hour trying to reproduce the problem with |
I couldn't find 20A5299w kernel debug kit on that download page... So I downloaded 20A2411 and hoped for the best... Anyway, I updated my kernel now. |
I tried CL 272258 (aka commit 7716a2fbb7abc24f1069b1fc4c4b10b2274eee8a) with and without the above sigaltstack revert CL 273627 rebased on top. With revert, all.bash passed 3/3 times. Without revert, all.bash passed 5/5 times. When I filed this issue, all.bash was failing at 7716a2fbb7abc24f1069b1fc4c4b10b2274eee8a what felt like 80% of the time—I'm not sure what has changed since then to cause it to start to pass so often. (I checked and |
I’ve come up with a reduced C testcase that exhibits the bug fairly readily: If it doesn’t crash with
Other times, it will show the thread calling |
The key to the How is Go allocating the memory used for thread stacks and signal stacks? |
When I posted my last comment, I hadn't seen the other comments that came in after #42774 (comment).
Ah, adding system load in the background while running all.bash made the initial issue very reproducible on my machine once again. I tried both CL 273627 and CL 273686, this time using background load during all.bash, and got more conclusive results:
¹ (details about the two inconclusive failures)
In summary, both CL 273627 and CL 273686 seem effective at resolving the original issue on my machine. |
Thanks for the C++ reproducer!
I think they are allocated from the Go heap (free page pool), and if that cannot be satisfied the heap will grow using mmap MAP_ANONYMOUS. We always have SA_ONSTACK set. So mlock the signal stacks should suffice as a workaround. |
I just revised I also found that if there’s only one child thread, it gives a reliable
(macOS’ |
I filed FB8922558 with Apple about these uncatchable spurious |
Is / was this a duplicate of #35851? (Can that issue be closed now?) |
@bcmills probably not. #35851 is about iOS, where we don't use sigaltstack. There is still the possibility that #35851 is due to that when the signal is delivered to the main stack and the main stack happens to be paged out, the kernel may still by buggy (at least I can't rule out this possibility), though much unlikely. Even if this is the case, the workaround for this issue won't help, as it is specific to sigaltstack. |
@cherrymui |
@eliasnaur what versions of iOS do we support? Last time I checked, our iOS builder machine doesn't support sigaltstack. The builder's kernel got updated recently, though. I could try again. For iOS simulators we always use sigaltstack, as for macOS/AMD64. |
As luck would have it, Corellium recently added iOS 14 images. I've upgraded all 3 builders to iOS 14.2. |
I don't think we have a policy for iOS versions. However, requiring iOS 14 for working Go programs is probably too little. Perhaps disable signal preemption on iOS < 14? |
Preemption signal is not the only issue here. It can be any signals, like user generated signals. So I don't think disabling preemption is preferable. We could use sigaltstack+mlock on iOS >=14. But before we do that, I think we would want to understand better if #35851 is the same cause or not. |
Ok, thank you for being thorough; my concern was to not leave iOS out of the fix for this issue in 1.16. How do we understand #35851 better? FWIW, I still have the iOS 13 images ready to run. |
If there is someway we can reproduce the failure more easily (like running the io benchmarks in this issue), that would be helpful. Then we can play with the workaround and other ideas and see if it is effective. It may also be helpful if you could get a crash report. Do you know how likely the failure occurs? From the dashboard, it seems that it failed once in February, then not failed for several months, then failed 3 times in a row in October 29, then not failed since then.
|
Another possibility could be #41702. The workaround (CL https://go-review.googlesource.com/c/go/+/262817 and https://go-review.googlesource.com/c/go/+/262438) only covers GOOS="darwin". Maybe we should extend them ios as well. |
CL 272258 has made it easy to run
all.bash
, so I ran it a few times. It has passed at least once, but most of the some test will fail due to "signal: illegal instruction".This is on a MacBook Air (M1, 2020), macOS 11.0.1 (20B29).
This may get resolved as part of finishing work on #42684, but reporting this in case these failures are helpful to see sooner, since I don't think of any existing macOS/ARM64 issues cover this failure mode specifically.
What version of Go are you using (
go version
)?What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Built Go at commit 7716a2fbb7 (CL 272258 PS 1) successfully, pointed GOROOT_BOOTSTRAP to it, then checked out the same commit elsewhere and ran
all.bash
repeatedly.What did you expect to see?
What did you see instead?
Most of the time, a failure like:
(Full log for
TestDWARF/testprog
failure.)CC @cherrymui.
The text was updated successfully, but these errors were encountered: