-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: hang in all.bash runtime test #14809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I just got another one of these on a trybot run: |
@dvyukov @aclements Have either of you had a chance to look at this yet? |
I spent all day trying to reproduce this on the darwin-amd64-10_10 gomote with no success. |
CL https://golang.org/cl/24110 mentions this issue. |
This is an attempt to get more information for #14809, which seems to occur rarely. Updates #14809. Change-Id: Idbeb136ceb57993644e03266622eb699d2685d02 Reviewed-on: https://go-review.googlesource.com/24110 Reviewed-by: Mikio Hara <[email protected]> Reviewed-by: Austin Clements <[email protected]>
Unfortunately, we haven't had any TestCgoExternalThreadSignal failures since 2016-06-01T00:09:32-04acd62/darwin-amd64-10_10, which was before @ianlancetaylor's CL to get stack traces of child processes. (In fact, the only runtime test timeouts since June 2nd have been on Plan 9.) There was one interesting failure around that time that looks related to signals, though it wasn't this specific test: 2016-06-01T00:09:32-04acd62/linux-ppc64le-buildlet. @ianlancetaylor, it looks like goroutine 207880 got preempted at the beginning of sigpanic and never came back. Do you think that could be related? |
I think it's impossible to tell whether the problem is related or not. They are timeouts, so that part is similar, but they are running different tests. Why would there be a runnable goroutine that nothing is running? That seems to imply that all the P's are in use, which does not seem to be the case, or that all the M's are doing something, but what?, or that all the M's are blocked waiting for something, but what? Since we don't know what's going on, It's probably worth doing something so that if a runtime test time outs we get a schedtrace(true). |
No idea what is happening. Postponing to 1.8. |
i've been able to consistently reproduce this. compiled/testing from current master (e5ff529) go env
steps taken
the hang happens every few hundred to few thousand iterations, leading to the following output:
i also captured several hangs with lastly, there is an instance of a |
Interesting. When the process hangs like this, it is waiting for
in a loop. Wait for it to hang--presumably it will eventually hang, otherwise I don't know what is going on. Then press |
By the way, you can build |
For what it's worth, I ran the test 10,000 times, and ran
|
uname -v output for me is
if i run running
|
Thanks. Unfortunately the traceback you showed just tells me that the program is waiting for another program to exit. The other program is |
i was able to use gdb to get the thread backtraces, and used delve to try and get some more info on the goroutines. if there's some other debugger info that'd be useful please let me know. this was from running gdb-thread-backtraces.txt *edit: also, wanted to mention that the hung process takes up ~100% of the cpu |
Thanks! Your program seems to be in a state where at least some uses of the |
If you can get a copy of |
yes, the lack of reproducibility is unfortunate. i haven't been able to capture the hang under i was able to get the same behavior with this reduced/standalone program:
backtraces: backtraces.txt i ran the program in a bash it seems to be the same issue, but i may be reading things incorrectly. i'll keep trying to see if i can get it to hang under |
finally was able to get the hang under dtruss dtruss.txt the total dtruss log is ~170 megs but that's the tail of it |
Anecdotally, this appears easier to reproduce on a heavily loaded system. |
CL https://golang.org/cl/33300 mentions this issue. |
I think I see the problem. The code in
On Darwin the raise system call is actually implemented as CL 33300 adds a sleep to give the kernel a chance to deliver the signal. I would be grateful if anybody able to recreate this problem could try to recreate it with that CL applied. Thanks. |
I've been able to get it to hang with a recent commit (d338f2e) With this CL I haven't seen it hang. I kept it running for a while in the background (probably like 4 or 5 hrs) |
Thanks for testing it. Sounds like it's the right fix. |
It looks like Darwin has a pthread_kill system call that's like Linux's tkill. Could we use that instead to explicitly target the signal at the current thread? |
I don't know. The comment in runtime/sys_darwin_amd64.s on
|
As far as I can tell there is no That might be easier to implement if we first fix #17200. |
At 43ed65f (Mar 13 2016), running all.bash. The only change in my client is in os/user/user_test.go which is clearly not involved.
/cc @dvyukov @aclements
The text was updated successfully, but these errors were encountered: