-
Notifications
You must be signed in to change notification settings - Fork 18k
test: locklinear.go is flaky #19276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @rsc |
Either flaky or detecting a real problem. I'll investigate on the dragonfly/amd64, assuming I can gomote that. |
Can't gomote. Can you try this on your s390x system: cp locklinear.go.txt locklinear.go and see what you get? |
Will do. The non-linearity I reported in the issue description appears to be operating system related. That machine runs RHEL 7.2 whereas the other machines I have access to run SLES 12 SP1 (the builder runs this) and Ubuntu 16.04 which both show linear behaviour. |
It crossed my mind that we might be tickling an OS problem, but I don't 100% see how, at least not in the locking being tested, which is all in user space. Unless the machine has memory for 64,000 goroutines but not memory for 128,000 goroutines. Seems unlikely, and the test tries smaller numbers before getting that far, and those failed too. Another worthwhile experiment is to rerun with 'const debug = false' changed to true. |
Ran that 3 times on the builder (SLES 12 SP 1), I don't think there are enough samples to see anything interesting:
|
Did the execution take 13 seconds but the profile only shows 2.2ms? That does suggest the OS. :-) |
Another option is Linux perf if you have kernel symbols. |
The openbsd problem appears to be real flakiness: I see it get a little too close to the maximum number of tries before succeeding when using a gomote, but I've never seen it fail. I did see it fail in a trybot just now on one of my own CLs. I sent a CL to deflake that, and also to add more debugging output in the event of a failure. Because you are seeing consistent failures on s390x I don't expect my CL will help, but it may give us more information. |
CL https://golang.org/cl/37348 mentions this issue. |
The non-linearity I was seeing on RHEL 7.2 was because I was accidentally using an old version of go on that machine. When I re-ran with tip it passes the test very reliably, only usually requires a couple of iterations. If I run the test on its own on the s390x builder it seems to run reliably. When it is run in the tests it runs in parallel with other tests, including the compilation of those tests. The number of tests it runs in parallel is equal to the number of CPUs, 4 on the s390x builder, which means 3 other test files could be being compiled and run at the same time this is executing. Seems bad for a test requiring consistent execution times, could be the source of some of the flakiness? |
@mundaym, we can make a "test/serial" or "test/isolated" directory and have a separate cmd/dist test unit just for that directory with no parallelism. |
This should help on the openbsd systems where the test mostly passes. I don't expect it to help on s390x where the test reliably fails. But it should give more information when it does fail. For #19276. Change-Id: I496c291f2b4b0c747b8dd4315477d87d03010059 Reviewed-on: https://go-review.googlesource.com/37348 Run-TryBot: Russ Cox <[email protected]> Reviewed-by: Brad Fitzpatrick <[email protected]>
Let's see what the deflaking does before we add more mechanism to the test directory. We've been running maplinear with no problem at all for a long time. |
A new type of locklinear failure, out of memory: https://build.golang.org/log/73611f9141109e3c9dc158791a19783b64d2af90
|
More dragonfly failures: https://build.golang.org/log/f504bb806ac24358bbdddb7e5e3e118f9f971a84 The numbers are very noisy, quite often the
Maybe it would be better to drop the constraint that the |
Before I had that constraint I saw the test passing incorrectly because the constant factors dominated the overhead. For example the 1000 vs 2000 being 15ms vs 13ms, clearly we're not testing linear growth at that point. CL 37543 for another attempt. If that fails then I'll (reluctantly) remove the lower bound entirely. |
I've been seeing the out-of-memory locklinear failures on trybots: |
487 is ERROR_INVALID_ADDRESS in runtime: VirtualAlloc of 385548288 bytes failed with errno=487 Alex |
My pending CL should help: it should pass with smaller N. I also cut the max N (really the max time) by about 2X. |
The dragonfly builder is still failing occasionally unfortunately. https://build.golang.org/log/e813e071d00d101e749409db47569a0dbda9620a |
Trybot failure on openbsd/amd64: |
Another one on the dragonfly builder just now: https://build.golang.org/log/a1a3edd60e8e373bc06f9ad09e74faae59491793 |
This just happened locally (darwin/amd64). Interestingly, when it happened, the rest of the test directory did not complete; all.bash hung. I wonder whether that was cause or symptom. Unfortunately, the stack traces all printed together, so they're pretty illegible, but here's the result of the SIGQUIT. The top stack trace--where the process was hung--is the panic line in locklinear; it appears the process did not exit as a result of the panic, perhaps due to some bad internal mutex state (?).
|
@josharian Was the compiler broken? (Fair question since I assume you're working on it.) |
Definitely a fair question. :) I'm pretty confident that the answer was no--the only local modifications were well-trodden and uninteresting refactorings, and the rest of the tests had already passes. |
OK, are you saying that even though the ^\ appears later, you typed it before the locklinear panic stack trace appeared? Just a little confused about ordering. The transcript makes it look like locklinear exited (and the exit was reported) before the SIGQUIT. |
Sorry. You're right, that's what happened. (Too much context-switching for my single-threaded brain.) So I guess there's nothing new/interesting here, just another failure. |
OK, I will expand the valid ratios from [2, 2.5) to [2, 3) and also just give up in the test if we consistently find that N takes longer than 2*N. |
CL https://golang.org/cl/39591 mentions this issue. |
Another one: https://storage.googleapis.com/go-build-log/265ee2c5/linux-386_4695dbaa.log
|
CL https://golang.org/cl/42431 mentions this issue. |
5 shards, each of which spins up NumCPU processes, each of which is running at GOMAXPROCS=NumCPU, is too much for one machine. It makes my laptop unusable. It might also be in part responsible for test flakes that require a moderately responsive system, like #18589 (backedge scheduling) and #19276 (locklinear). It's possible that Go should be a better neighbor in general; that's #17969. In the meantime, fix this corner of the world. Builders snapshot the world and run shards on different machines, so keeping sharding high for them is good. This is a partial reversion of CL 18199. Fixes #20141. Change-Id: I123cf9436f4f4da3550372896265c38117b78071 Reviewed-on: https://go-review.googlesource.com/42431 Reviewed-by: Brad Fitzpatrick <[email protected]>
5 shards, each of which spins up NumCPU processes, each of which is running at GOMAXPROCS=NumCPU, is too much for one machine. It makes my laptop unusable. It might also be in part responsible for test flakes that require a moderately responsive system, like golang#18589 (backedge scheduling) and golang#19276 (locklinear). It's possible that Go should be a better neighbor in general; that's golang#17969. In the meantime, fix this corner of the world. Builders snapshot the world and run shards on different machines, so keeping sharding high for them is good. This is a partial reversion of CL 18199. Fixes golang#20141. Change-Id: I123cf9436f4f4da3550372896265c38117b78071
I haven't seen this in awhile. Maybe fixed? |
|
I'm happy. |
Seen on the builders (page 1):
linux/s390x: https://build.golang.org/log/addcd4b8de793f6b79065f2dc9579f06115d4079
openbsd/386: https://build.golang.org/log/08035f048b8b36a58aeaf06a8fa90d75377f3417
dragonfly/amd64: https://build.golang.org/log/dfc83b1a92a75b3fbff84c1675ecffb0ab6fd214
dragonfly/amd64: https://build.golang.org/log/cb41b327d857187c92ec62aef17cde179d60db8f
dragonfly/amd64: https://build.golang.org/log/a7ac7ef8d0cdcffc3a52e1bb2751391e49d13ef7
dragonfly/amd64: https://build.golang.org/log/a8aec06e3e8be117ce5500d822f25bbaf2e78a86
dragonfly/amd64: https://build.golang.org/log/c36d2439a65b4614fe1ffb33f3f6aa0f637136c0
The error looks similar to:
I also can't get this test to pass at all on a linux/s390x machine set up for performance testing, it looks like the relationship isn't actually linear, instead there is a 3-4x jump every time n is doubled. I'll open another issue for that once I have more information.EDIT: GOROOT was set and pointed to an old version of Go.I'm not sure this test can be made to run reliably on virtual machines as-is, perhaps there is an alternative way to test this behavior?
The text was updated successfully, but these errors were encountered: