-
Notifications
You must be signed in to change notification settings - Fork 18k
test/stack.go fails intermittently on arm5 #1750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
Comments
As requested here is the output from 10 passes of 100 runs of test/stack.go Methodology: % for j in {1..10} ; do (5g stack.go ; 5l stack.5 ; for i in {1..100} ; do ./5.out 2>&1 ; if [ $? -ne 0 ] ; then echo "FAIL $i" ; else echo "PASS $i" ; fi ; done) > ~/stack.out$j ; done summary: axentraserver(~) % grep -c FAIL stack.out* stack.out1:1 stack.out10:2 stack.out2:2 stack.out3:2 stack.out4:4 stack.out5:1 stack.out6:3 stack.out7:0 stack.out8:2 stack.out9:0 Attachments:
|
Now, with GOGC=off methodology: % for j in {1..10} ; do (5g stack.go ; 5l stack.5 ; for i in {1..100} ; do GOGC=off ./5.out 2>&1 ; if [ $? -ne 0 ] ; then echo "FAIL $i" ; else echo "PASS $i" ; fi ; done) > ~/stack-nogc.out$j ; done summary: % grep -c FAIL stack-gogc-off.out* stack-gogc-off.out1:4 stack-gogc-off.out10:1 stack-gogc-off.out2:1 stack-gogc-off.out3:1 stack-gogc-off.out4:0 stack-gogc-off.out5:3 stack-gogc-off.out6:3 stack-gogc-off.out7:5 stack-gogc-off.out8:2 stack-gogc-off.out9:2 Attachments:
|
Here is a sample output with GOGC=1. Runtimes were enormous but the results were the same % grep -c FAIL stack-gogc-1.out1 2 Attachments:
|
http://godashboard.appspot.com/log/eb7c64c4053af444efcaace011b111bcf33d033bd503124313b5648c463d6707 showed that test/chan/doubleselect.go is also a good test case. The stack trace is very different to deadlocks seen before throw: all goroutines are asleep - deadlock! [0x1051cfac] runtime.throw+0x40 /home/dfc/go/src/pkg/runtime/runtime.c:102 runtime.throw(0xcdf93, 0x1da94) [0x1051cfbc] nextgandunlock+0x174 /home/dfc/go/src/pkg/runtime/proc.c:393 nextgandunlock() [0x1051cfcc] schedule+0xe8 /home/dfc/go/src/pkg/runtime/proc.c:572 schedule(0x104e7000, 0x24800) [0x1051cfcc] runtime.mcall+0x34 /home/dfc/go/src/pkg/runtime/arm/asm.s:149 runtime.mcall(0x104e7000, 0x24800) [0x1051cfcc] main.sender /home/dfc/go/test/chan/doubleselect.go:21 main.sender(0x104e7000, 0x24800, 0x104e7d90, 0x0, 0x0, ...) [0x1051cfcc] main.sender /home/dfc/go/test/chan/doubleselect.go:21 main.sender(0x104e7000, 0x24800, 0x104e7d90, 0x0, 0x0, ...) [0x1051cfcc] main.sender /home/dfc/go/test/chan/doubleselect.go:21 main.sender(0x104e7000, 0x24800, 0x104e7d90, 0x0, 0x0, ...) I thought that calls though runtime.mcall were on different stacks? % grep -c FAIL double.out 23 Attachments:
|
mcall is what actually does the stack switch so it would appear in both and apparently confuses the backtrace routines. I think this is another send + received missed each other, except that there's a select doing one or the other op so we can't see the channel pointers on the stack trace. The real question is why the locks are not providing mutual exclusion. Either the lock implementation is broken in such a way that it only fails on ARM, or the Linux CAS implementation (which we jump to at ffff0fc0) is broken on your kernel (I looked at LXR's copy of that kernel without the -Netgear and it seemed okay). That's all I can think of. Russ |
Thanks Russ, (gdb) r Starting program: /home/dfc/go/src/cmd/gofix/5.out osopen.0:12:13: rewrote os.Open with O_CREATE but not O_TRUNC to os.Create osopen.0:13:2: rewrote os.Open to os.Create with permission not 0666 osopen.0:14:13: rewrote os.Open with O_CREATE but not O_TRUNC to os.Create osopen.0:15:2: rewrote os.Open to os.Create with permission not 0666 testdata/reflect.encode.go.in:299:7: cannot rewrite reflect type switch case with non-reflect type interfaceOrPtrValue Program received signal SIGSTOP, Stopped (signal). 0x0001f408 in ?? () (gdb) x/20i 0xffff0fc0 0xffff0fc0: Cannot access memory at address 0xffff0fc0 This should work, right ? |
Comment 11 by [email protected]: Same here (an old android G1): Linux localhost 2.6.27-00392-g8312baf #1 PREEMPT Sat May 23 20:19:35 CDT 2009 armv6l unknown Qualcomm MSM7201A (ARM11 core) (gdbserver attached to some process) Remote debugging using :1234 0xafe0d3e4 in ?? () (gdb) x/20i 0xffff0fc0 0xffff0fc0: Cannot access memory at address 0xffff0fc0 (gdb) x/20i 0xafe0d3e4 0xafe0d3e4: ldmia sp!, {r4, r7} 0xafe0d3e8: bx lr ... Here some related infos: http://lwn.net/Articles/314561/ http://osdir.com/ml/android-ndk/2010-06/msg00534.html and the code in bionic: http://www.google.com/codesearch/p?hl=en#XAzRy8oK4zA/libc/arch-arm/bionic/atomics_arm.S Martin |
Comment 12 by [email protected]: Hmm, it's there: arm5: arm-none-linux-gnueabi-objdump -D --target=binary -EL -z -w --architecture=arm mdump.dat mdump.dat: file format binary Disassembly of section .data: 00000000 <.data>: 0: e5923000 ldr r3, [r2] 4: e0533000 subs r3, r3, r0 8: 05821000 streq r1, [r2] c: e2730000 rsbs r0, r3, #0 10: e12fff1e bx lr 14: e1a00000 nop ; (mov r0, r0) 18: e1a00000 nop ; (mov r0, r0) 1c: e1a00000 nop ; (mov r0, r0) 20: e59f0008 ldr r0, [pc, #8] ; 0x30 24: e12fff1e bx lr 28: 00000000 andeq r0, r0, r0 2c: 00000000 andeq r0, r0, r0 30: 000924c0 andeq r2, r9, r0, asr #9 34: 00000000 andeq r0, r0, r0 38: 00000000 andeq r0, r0, r0 3c: 00000003 andeq r0, r0, r3 arm >=6: 00000000 <.data>: 0: e1923f9f ldrex r3, [r2] 4: e0533000 subs r3, r3, r0 8: 01823f91 strexeq r3, r1, [r2] c: 03330001 teqeq r3, #1 10: 0afffffa beq 0x0 14: e2730000 rsbs r0, r3, #0 18: e12fff1e bx lr 1c: 00000000 andeq r0, r0, r0 20: e59f0008 ldr r0, [pc, #8] ; 0x30 24: e12fff1e bx lr 28: 00000000 andeq r0, r0, r0 2c: 00000000 andeq r0, r0, r0 30: 000924c0 andeq r2, r9, r0, asr #9 34: 00000000 andeq r0, r0, r0 38: 00000000 andeq r0, r0, r0 3c: 00000003 andeq r0, r0, r3 |
Hi Russ, You were right, it is the locking. I patched in the old arm5 asm from http://code.google.com/p/go/source/detail?r=5a47c03cf0 and now cannot cause tests like gofix or chan/doubleselect.go to deadlock. I'm going to setup a few tests cases and run them overnight to see if there are any lingering deadlocks. Cheers Dave |
Okay, that's good to know. Thanks very much. Now we have to figure out the right fix. I don't want to have to select different assembly for each chip during the build process. That's the whole reason the kernel leaves an implementation at ffff0fc0. On ARM 5 the kernel is supposed to treat the cas implementation as a restartable atomic sequence: if a process gets interrupted in the middle of the sequence the kernel restarts it at the beginning, so that the read+compare+write is guaranteed not to have been interrupted and therefore ran (effectively) atomically. We must be doing something different compared to the C library that is making the kernel not restart the sequence properly. Do you know where the restarting code is? Russ |
Hi Russ, With the old arm5 swp code patched back in I was able complete 30 runs of test/run without a single failure. For reference I've uploaded the source published by Netgear, from a cursory look this is probably stock 2.6.22 plus some drivers. https://bitbucket.org/davecheney/2.6.22.18-netgear/src/e87e5d833b12/arch/arm/kernel/entry-armv.S#cl-770 From the comment block the 'restarting' is apparently done by clearing 'z' on a possible return from kernel space if the address of the faulting instruction (waves hands) https://bitbucket.org/davecheney/2.6.22.18-netgear/src/e87e5d833b12/arch/arm/kernel/entry-armv.S#cl-346 |
I'm sure that fixes the problem. I'd just like to figure out how to work with the Linux kernel implementation if at all possible, because it means we don't have to build binaries targeted to specific processors. Thank you! I see the problem. We are relying on the compare-and-swap doing a real compare-and-swap but in fact this implementation can return a false negative. This is fixed in later versions of the kernel (2.6.24+) after this change: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b49c0f24cf6744a3f4fd09289fe7cade349dead5 As that change description says, most uses are in a loop so it doesn't matter. But there is one - in chan.c - that depends on the compare and swap never returning a false negative. Please try 4436072. Russ |
This issue was closed by revision 540feaa. Status changed to Fixed. |
rsc
added a commit
that referenced
this issue
May 11, 2015
««« CL 4436072 / e280d98747be runtime, sync/atomic: fix arm cas Works around bug in kernel implementation on old ARM5 kernels. Bug was fixed on 26 Nov 2007 (between 2.6.23 and 2.6.24) but old kernels persist. Fixes #1750. R=dfc, golang-dev CC=golang-dev https://golang.org/cl/4436072 »»» TBR=adg CC=golang-dev https://golang.org/cl/4452064
This issue was closed.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The text was updated successfully, but these errors were encountered: