-
Notifications
You must be signed in to change notification settings - Fork 18k
os/signal: TestNohup flaky #33174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Same symptom on |
Note that the case in #33174 (comment) is not the same symptom, but rather the reverse:
|
I think all these cases could be explained if there is an unusually long time, more than 100 milliseconds, between calling |
Change https://golang.org/cl/194879 mentions this issue: |
Still flaky, unfortunately: |
Both of those failures are with
|
Change https://golang.org/cl/226138 mentions this issue: |
Use a uniform function (named “quiesce”) to wait for possible signals in a way that gives the kernel many opportunities to deliver them. Simplify channel usage and concurrency in stress tests. Use (*testing.T).Deadline instead of parsing the deadline in TestMain. In TestStop, sleep forever in a loop if we expect the test to die from a signal. That should reduce the flakiness of TestNohup, since TestStop will no longer spuriously pass when run as a subprocess of TestNohup. Since independent signals should not interfere, run the different signals in TestStop in parallel when testing in short mode. Since TestNohup runs TestStop as a subprocess, and TestStop needs to wait many times for signals to quiesce, run its test subprocesses concurrently and in short mode — reducing the latency of that test by more than a factor of 2. The above two changes reduce the running time of TestNohup on my workstation to ~345ms, making it possible to run much larger counts of the test in the same amount of wall time. If the test remains flaky after this CL, we can spend all or part of that latency improvement on a longer settle time. Updates #33174 Change-Id: I09206f213d8c1888b50bf974f965221a5d482419 Reviewed-on: https://go-review.googlesource.com/c/go/+/226138 Run-TryBot: Bryan C. Mills <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]>
Change https://golang.org/cl/226458 mentions this issue: |
Change https://golang.org/cl/226461 mentions this issue: |
…SIGUSR1 on Android In CL 226138, I updated TestStop to have more uniform behavior for its signals. However, that test seems to always fail for SIGUSR1 on the Android ARM builders. I'm not sure what's special about Android for this particular case, but let's skip the test to unbreak the builders while I investigate. For #38165 Updates #33174 Change-Id: I35a70346cd9757a92acd505a020bf95e6871405c Reviewed-on: https://go-review.googlesource.com/c/go/+/226458 Run-TryBot: Bryan C. Mills <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]>
For reasons unknown, SIGUSR1 appears to be blocked at process start for tests on the android-arm-corellium and android-arm64-corellium builders. (This has been observed before, too: see CL 203957.) Make the test resilient to blocked signals by always calling Notify and waiting for potential signal delivery after sending any signal that is not known to be unblocked. Also remove the initial SIGWINCH signal from testCancel. The behavior of an unhandled SIGWINCH is already tested in TestStop, so we don't need to re-test that same case: waiting for an unhandled signal takes a comparatively long time (because we necessarily don't know when it has been delivered), so this redundancy makes the overall test binary needlessly slow, especially since it is called from both TestReset and TestIgnore. Since each signal is always unblocked while we have a notification channel registered for it, we don't need to modify any other tests: TestStop and testCancel are the only functions that send signals without a registered channel. Fixes #38165 Updates #33174 Updates #15661 Change-Id: I215880894e954b62166024085050d34323431b63 Reviewed-on: https://go-review.googlesource.com/c/go/+/226461 Run-TryBot: Bryan C. Mills <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> TryBot-Result: Gobot Gobot <[email protected]>
Change https://golang.org/cl/227649 mentions this issue: |
I noticed a timeout in TestIgnore in https://build.golang.org/log/52d83a72f3a5ea9a16eb5d670c729694144f9624, which suggests that the settle time is currently set too low. I've also added a check for the same GO_TEST_TIMEOUT_SCALE used in TestTerminalSignal, so that if this builder remains too slow we can increase the builder's scale factor rather than the test's baseline running time. Updates #33174 Change-Id: I18b10eaa3bb5ae2f604300aedaaf6f79ee7ad567 Reviewed-on: https://go-review.googlesource.com/c/go/+/227649 Run-TryBot: Bryan C. Mills <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]>
Rats, still flaky even with a longer settle time: I'm tempted to dramatically overinflate the settle time on this builder just to see if it's actually dropping the signal. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Change https://golang.org/cl/228266 mentions this issue: |
@rorth, are there any known signal-dropping bugs on Oracle Solaris that may account for this flakiness? |
"Bryan C. Mills" <[email protected]> writes:
@rorth, are there any known signal-dropping bugs on Oracle Solaris that may
account for this flakiness?
None that I'm aware of.
When this first came up, I tried to do some debugging, but the
experience was horrible. With golang, I couldn't get any code with
sufficient debug info to learn anything. Still, the stacktrace below is
obviously from golang.
Then I went go gccgo, rebuilt all of libgo with -g3 -O0, and could
reproduce the problem every one in a while. However, I didn't manage to
make much of the stacktrace I saw (ISTR that I only got the panic when
running with coredumps enabled, not while running under gdb
interactively) and so ultimately gave up.
I'd been running like this:
$ or x in `seq 1 1000`; do ~/net.test -test.timeout=10m0s -test.count=1 -test.run TestUnixAndUnixpacketServer; done
…--- FAIL: TestUnixAndUnixpacketServer (10.00s)
server_test.go:152: skipping unix @nettest/go/unix test
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x5c3c60]
goroutine 19 [running]:
panic(0x64d5a0, 0x858fa0)
/var/gcc/go/goroot/src/runtime/panic.go:1054 +0x420 fp=0xc0000c9a50 sp=0xc0000c99a8 pc=0x434af0
testing.tRunner.func1(0xc0001d6120)
/var/gcc/go/goroot/src/testing/testing.go:916 +0x604 fp=0xc0000c9b40 sp=0xc0000c9a50 pc=0x4d9b34
runtime.call32(0x0, 0x69da70, 0xc000132048, 0x800000008)
/var/gcc/go/goroot/src/runtime/asm_amd64.s:539 +0x3b fp=0xc0000c9b70 sp=0xc0000c9b40 pc=0x4632cb
runtime.reflectcallSave(0xc0000c9c90, 0x69da70, 0xc000132048, 0x8)
/var/gcc/go/goroot/src/runtime/panic.go:873 +0x58 fp=0xc0000c9ba0 sp=0xc0000c9b70 pc=0x434628
runtime.runOpenDeferFrame(0xc000184180, 0xc000132000, 0xc000198138)
/var/gcc/go/goroot/src/runtime/panic.go:847 +0x2b9 fp=0xc0000c9c30 sp=0xc0000c9ba0 pc=0x4344e9
panic(0x64d5a0, 0x858fa0)
/var/gcc/go/goroot/src/runtime/panic.go:961 +0x15d fp=0xc0000c9cd8 sp=0xc0000c9c30 pc=0x43482d
runtime.panicmem(...)
/var/gcc/go/goroot/src/runtime/panic.go:212
runtime.sigpanic()
/var/gcc/go/goroot/src/runtime/signal_unix.go:667 +0x3da fp=0xc0000c9d08 sp=0xc0000c9cd8 pc=0x44ac7a
net.TestUnixAndUnixpacketServer(0xc0001d6120)
/var/gcc/go/goroot/src/net/server_test.go:197 +0x670 fp=0xc0000c9f80 sp=0xc0000c9d08 pc=0x5c3c60
testing.tRunner(0xc0001d6120, 0x69d188)
/var/gcc/go/goroot/src/testing/testing.go:954 +0xdc fp=0xc0000c9fd0 sp=0xc0000c9f80 pc=0x4d5adc
runtime.goexit()
/var/gcc/go/goroot/src/runtime/asm_amd64.s:1375 +0x1 fp=0xc0000c9fd8 sp=0xc0000c9fd0 pc=0x465081
created by testing.(*T).Run
/var/gcc/go/goroot/src/testing/testing.go:1005 +0x357
goroutine 1 [chan receive]:
runtime.gopark(0x69d638, 0xc00019a238, 0x170e, 0x2)
/var/gcc/go/goroot/src/runtime/proc.go:304 +0xe0 fp=0xc00007bb48 sp=0xc00007bb28 pc=0x437790
runtime.chanrecv(0xc00019a1e0, 0xc00007bc5f, 0xc000000101, 0x4d5ea7)
/var/gcc/go/goroot/src/runtime/chan.go:563 +0x338 fp=0xc00007bbd8 sp=0xc00007bb48 pc=0x407078
runtime.chanrecv1(0xc00019a1e0, 0xc00007bc5f)
/var/gcc/go/goroot/src/runtime/chan.go:433 +0x2b fp=0xc00007bc08 sp=0xc00007bbd8 pc=0x406ceb
testing.(*T).Run(0xc0001d6120, 0x692710, 0x1b, 0x69d188, 0x4a5401)
/var/gcc/go/goroot/src/testing/testing.go:1006 +0x37e fp=0xc00007bcb8 sp=0xc00007bc08 pc=0x4d5ece
testing.runTests.func1(0xc0001d6000)
/var/gcc/go/goroot/src/testing/testing.go:1247 +0x78 fp=0xc00007bd08 sp=0xc00007bcb8 pc=0x4d9c08
testing.tRunner(0xc0001d6000, 0xc00007bde8)
/var/gcc/go/goroot/src/testing/testing.go:954 +0xdc fp=0xc00007bd58 sp=0xc00007bd08 pc=0x4d5adc
testing.runTests(0xc0001ce040, 0x86be40, 0xe4, 0xe4, 0x0)
/var/gcc/go/goroot/src/testing/testing.go:1245 +0x2a7 fp=0xc00007be18 sp=0xc00007bd58 pc=0x4d73d7
testing.(*M).Run(0xc0000b6000, 0x0)
/var/gcc/go/goroot/src/testing/testing.go:1162 +0x15f fp=0xc00007bef8 sp=0xc00007be18 pc=0x4d632f
net.TestMain(0xc0000b6000)
/var/gcc/go/goroot/src/net/main_test.go:52 +0x39 fp=0xc00007bf20 sp=0xc00007bef8 pc=0x5ad739
main.main()
_testmain.go:564 +0x135 fp=0xc00007bf88 sp=0xc00007bf20 pc=0x61aaa5
runtime.main()
/var/gcc/go/goroot/src/runtime/proc.go:203 +0x212 fp=0xc00007bfe0 sp=0xc00007bf88 pc=0x4373a2
runtime.goexit()
/var/gcc/go/goroot/src/runtime/asm_amd64.s:1375 +0x1 fp=0xc00007bfe8 sp=0xc00007bfe0 pc=0x465081
goroutine 2 [force gc (idle)]:
runtime.gopark(0x69d828, 0x86e0d0, 0x1411, 0x1)
/var/gcc/go/goroot/src/runtime/proc.go:304 +0xe0 fp=0xc000068fb0 sp=0xc000068f90 pc=0x437790
runtime.goparkunlock(...)
/var/gcc/go/goroot/src/runtime/proc.go:310
runtime.forcegchelper()
/var/gcc/go/goroot/src/runtime/proc.go:253 +0xb7 fp=0xc000068fe0 sp=0xc000068fb0 pc=0x437647
runtime.goexit()
/var/gcc/go/goroot/src/runtime/asm_amd64.s:1375 +0x1 fp=0xc000068fe8 sp=0xc000068fe0 pc=0x465081
created by runtime.init.6
/var/gcc/go/goroot/src/runtime/proc.go:242 +0x35
goroutine 3 [GC sweep wait]:
runtime.gopark(0x69d828, 0x86e2c0, 0x140c, 0x1)
/var/gcc/go/goroot/src/runtime/proc.go:304 +0xe0 fp=0xc0000697a8 sp=0xc000069788 pc=0x437790
runtime.goparkunlock(...)
/var/gcc/go/goroot/src/runtime/proc.go:310
runtime.bgsweep(0xc000090000)
/var/gcc/go/goroot/src/runtime/mgcsweep.go:70 +0x9c fp=0xc0000697d8 sp=0xc0000697a8 pc=0x42316c
runtime.goexit()
/var/gcc/go/goroot/src/runtime/asm_amd64.s:1375 +0x1 fp=0xc0000697e0 sp=0xc0000697d8 pc=0x465081
created by runtime.gcenable
/var/gcc/go/goroot/src/runtime/mgc.go:214 +0x5c
goroutine 4 [GC scavenge wait]:
runtime.gopark(0x69d828, 0x86e280, 0x140d, 0x1)
/var/gcc/go/goroot/src/runtime/proc.go:304 +0xe0 fp=0xc000069f50 sp=0xc000069f30 pc=0x437790
runtime.goparkunlock(...)
/var/gcc/go/goroot/src/runtime/proc.go:310
runtime.bgscavenge(0xc000090000)
/var/gcc/go/goroot/src/runtime/mgcscavenge.go:220 +0xd3 fp=0xc000069fd8 sp=0xc000069f50 pc=0x421953
runtime.goexit()
/var/gcc/go/goroot/src/runtime/asm_amd64.s:1375 +0x1 fp=0xc000069fe0 sp=0xc000069fd8 pc=0x465081
created by runtime.gcenable
/var/gcc/go/goroot/src/runtime/mgc.go:215 +0x7e
goroutine 18 [finalizer wait]:
runtime.gopark(0x69d828, 0x88aa70, 0x1410, 0x1)
/var/gcc/go/goroot/src/runtime/proc.go:304 +0xe0 fp=0xc000064758 sp=0xc000064738 pc=0x437790
runtime.goparkunlock(...)
/var/gcc/go/goroot/src/runtime/proc.go:310
runtime.runfinq()
/var/gcc/go/goroot/src/runtime/mfinal.go:175 +0xa3 fp=0xc0000647e0 sp=0xc000064758 pc=0x418f63
runtime.goexit()
/var/gcc/go/goroot/src/runtime/asm_amd64.s:1375 +0x1 fp=0xc0000647e8 sp=0xc0000647e0 pc=0x465081
created by runtime.createfing
/var/gcc/go/goroot/src/runtime/mfinal.go:156 +0x61
Abort (core dumped)
|
…el builder This is an attempt to distinguish between a dropped signal and general builder slowness. The previous attempt (increasing the settle time to 250ms) still resulted in a timeout: https://build.golang.org/log/dd62939f6d3b512fe3e6147074a9c6db1144113f For #33174 Change-Id: I79027e91ba651f9f889985975f38c7b01d82f634 Reviewed-on: https://go-review.googlesource.com/c/go/+/228266 Run-TryBot: Bryan C. Mills <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]>
No timeouts on Unfortunately, it seems that 100ms may be too short on the other builders: |
Change https://golang.org/cl/232378 mentions this issue: |
Changing milestone to 1.16. |
Moving the milestone to Go1.17, but good to note that we haven't seen this flake in a long time. |
The errors seem to consistently be of the form
(for differing signals). This is where the process sends itself the signal, and then waits for it to arrive. |
Every failure is with
That run is failing waiting for a signal, although the signal that it is waiting for varies. |
The failure always occurs on a separate run of This is a real stretch, but as far as I can tell there are no atomic operations between setting So while I doubt this is the problem, I'll try removing the test of |
Change https://golang.org/cl/312131 mentions this issue: |
Signals can be delivered on a different thread. There is no necessary happens-before relationship between setting sig.inuse in signal_enable and checking it in sigsend. It is theoretically possible, if unlikely, that sig.inuse is set by thread 1, thread 2 receives a signal, does not see that sig.inuse is set, and discards the signal. This could happen if the signal is received immediately after the first call to signal_enable. For #33174 Change-Id: Idb0f1c77847b7d4d418bd139e801c0c4460531d2 Reviewed-on: https://go-review.googlesource.com/c/go/+/312131 Trust: Ian Lance Taylor <[email protected]> Run-TryBot: Ian Lance Taylor <[email protected]> TryBot-Result: Go Bot <[email protected]> Reviewed-by: Bryan C. Mills <[email protected]>
Change https://go.dev/cl/403199 mentions this issue: |
The settleTime is arbitrary. Ideally we should refactor the test to avoid it (using subprocesses instead of sleeps to isolate tests from each others' delayed signals), but as a shorter-term workaround let's try scaling it back to match linux/ppc64 (the other builder that empirically requires a longer settleTime). For #51054. Updates #33174. Change-Id: I574fffaadd74c52c13d63974e87f20b6d3cf3c4b Reviewed-on: https://go-review.googlesource.com/c/go/+/403199 TryBot-Result: Gopher Robot <[email protected]> Auto-Submit: Bryan Mills <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Run-TryBot: Bryan Mills <[email protected]>
Seen on the
solaris-amd64-oraclerel
builder in https://build.golang.org/log/d5185d101eddc0f5a05103f11b74caf24421bf46:See previously #8682, #5526.
CC @ianlancetaylor
The text was updated successfully, but these errors were encountered: