Skip to content

test/stack.go fails intermittently on arm5 #1750

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
davecheney opened this issue Apr 29, 2011 · 20 comments
Closed

test/stack.go fails intermittently on arm5 #1750

davecheney opened this issue Apr 29, 2011 · 20 comments

Comments

@davecheney
Copy link
Contributor

The linux-arm-arm5 dashboard build is unreliable, failing often during the src/all.bash
build with "throw: all goroutines are asleep - deadlock!". This issue spawns
from a discussion on golang-dev,

http://groups.google.com/group/golang-dev/browse_thread/thread/66223a50df58a39c/07a6b34f183bd381?lnk=gst&;q=arm5+#07a6b34f183bd381

What is the expected output?

Correct programs should not deadlock

What do you see instead?

throw: all goroutines are asleep - deadlock!

Which compiler are you using (5g, 6g, 8g, gccgo)?

Which operating system are you using?

% uname -a
Linux axentraserver.buckminster.mystora.com 2.6.22.18-Netgear #16 Sun Jun 13 19:54:46
EDT 2010 armv5tejl armv5tejl armv5tejl GNU/Linux

The OS is reportedly a varient of RHEL5 compiled for arm5

% cat /etc/redhat-release 
HipServ

Which revision are you using?  (hg identify)

+tip (this machine runs the linux-arm-arm5) dashboard build

Please provide any additional information below.

% cat /proc/cpuinfo 
Processor   : ARM926EJ-S rev 1 (v5l)
BogoMIPS    : 992.87
Features    : swp half thumb fastmult edsp 
CPU implementer : 0x56
CPU architecture: 5TE
CPU variant : 0x2
CPU part    : 0x131
CPU revision    : 1
Cache type  : write-back
Cache clean : cp15 c7 ops
Cache lockdown  : format C
Cache format    : Harvard
I size      : 16384
I assoc     : 4
I line length   : 32
I sets      : 128
D size      : 16384
D assoc     : 4
D line length   : 32
D sets      : 128

Hardware    : Feroceon-KW
Revision    : 0000
Serial      : 0000000000000000
@davecheney
Copy link
Contributor Author

Comment 1:

As requested here is the output from 10 passes of 100 runs of test/stack.go
Methodology:
% for j in {1..10} ; do (5g stack.go ; 5l stack.5 ; for i in {1..100} ; do  ./5.out 2>&1
; if [ $? -ne 0 ] ; then echo "FAIL $i" ; else echo "PASS $i" ; fi ; done) >
~/stack.out$j ; done
summary:
axentraserver(~) % grep -c FAIL stack.out* 
stack.out1:1
stack.out10:2
stack.out2:2
stack.out3:2
stack.out4:4
stack.out5:1
stack.out6:3
stack.out7:0
stack.out8:2
stack.out9:0

Attachments:

  1. stack.out1 (10295 bytes)
  2. stack.out2 (19798 bytes)
  3. stack.out3 (19891 bytes)
  4. stack.out4 (39176 bytes)
  5. stack.out5 (5246 bytes)
  6. stack.out6 (29301 bytes)
  7. stack.out7 (792 bytes)
  8. stack.out8 (19891 bytes)
  9. stack.out9 (792 bytes)
  10. stack.out10 (19823 bytes)

@davecheney
Copy link
Contributor Author

Comment 2:

Now, with GOGC=off
methodology: 
% for j in {1..10} ; do (5g stack.go ; 5l stack.5 ; for i in {1..100} ; do GOGC=off
./5.out 2>&1 ; if [ $? -ne 0 ] ; then echo "FAIL $i" ; else echo "PASS $i" ; fi ; done)
> ~/stack-nogc.out$j ; done
summary:
 % grep -c FAIL stack-gogc-off.out*
stack-gogc-off.out1:4
stack-gogc-off.out10:1
stack-gogc-off.out2:1
stack-gogc-off.out3:1
stack-gogc-off.out4:0
stack-gogc-off.out5:3
stack-gogc-off.out6:3
stack-gogc-off.out7:5
stack-gogc-off.out8:2
stack-gogc-off.out9:2

Attachments:

  1. stack-gogc-off.out1 (38804 bytes)
  2. stack-gogc-off.out2 (10295 bytes)
  3. stack-gogc-off.out3 (10388 bytes)
  4. stack-gogc-off.out4 (792 bytes)
  5. stack-gogc-off.out5 (29485 bytes)
  6. stack-gogc-off.out6 (29376 bytes)
  7. stack-gogc-off.out7 (48492 bytes)
  8. stack-gogc-off.out8 (19991 bytes)
  9. stack-gogc-off.out9 (19798 bytes)
  10. stack-gogc-off.out10 (10295 bytes)

@rsc
Copy link
Contributor

rsc commented Apr 29, 2011

Comment 3:

Thanks.  I am happy that the problem persists
with GOGC=off.  That eliminates the garbage collector.
It really looks like somehow the locks are broken.
The only way that can happen is if the Linux cas
implementation is wrong.
What does uname -a print?

@davecheney
Copy link
Contributor Author

Comment 4:

% uname -a
Linux axentraserver.buckminster.mystora.com 2.6.22.18-Netgear #16 Sun Jun 13 19:54:46
EDT 2010 armv5tejl armv5tejl armv5tejl GNU/Linux

@davecheney
Copy link
Contributor Author

Comment 5:

Here is a sample output with GOGC=1. Runtimes were enormous but the results were the same
% grep -c FAIL stack-gogc-1.out1 
2

Attachments:

  1. stack-gogc-1.out1 (19983 bytes)

@davecheney
Copy link
Contributor Author

Comment 6:

http://godashboard.appspot.com/log/eb7c64c4053af444efcaace011b111bcf33d033bd503124313b5648c463d6707
showed that test/chan/doubleselect.go is also a good test case. The stack trace is very
different to deadlocks seen before
throw: all goroutines are asleep - deadlock!
[0x1051cfac] runtime.throw+0x40 /home/dfc/go/src/pkg/runtime/runtime.c:102
        runtime.throw(0xcdf93, 0x1da94)
[0x1051cfbc] nextgandunlock+0x174 /home/dfc/go/src/pkg/runtime/proc.c:393
        nextgandunlock()
[0x1051cfcc] schedule+0xe8 /home/dfc/go/src/pkg/runtime/proc.c:572
        schedule(0x104e7000, 0x24800)
[0x1051cfcc] runtime.mcall+0x34 /home/dfc/go/src/pkg/runtime/arm/asm.s:149
        runtime.mcall(0x104e7000, 0x24800)
[0x1051cfcc] main.sender /home/dfc/go/test/chan/doubleselect.go:21
        main.sender(0x104e7000, 0x24800, 0x104e7d90, 0x0, 0x0, ...)
[0x1051cfcc] main.sender /home/dfc/go/test/chan/doubleselect.go:21
        main.sender(0x104e7000, 0x24800, 0x104e7d90, 0x0, 0x0, ...)
[0x1051cfcc] main.sender /home/dfc/go/test/chan/doubleselect.go:21
        main.sender(0x104e7000, 0x24800, 0x104e7d90, 0x0, 0x0, ...)
I thought that calls though runtime.mcall were on different stacks?
% grep -c FAIL double.out
23

Attachments:

  1. double.out (358686 bytes)

@rsc
Copy link
Contributor

rsc commented Apr 30, 2011

Comment 7:

mcall is what actually does the stack switch so it would appear in both
and apparently confuses the backtrace routines.  I think this is another
send + received missed each other, except that there's a select doing
one or the other op so we can't see the channel pointers on the stack trace.
The real question is why the locks are not providing mutual exclusion.
Either the lock implementation is broken in such a way that it only
fails on ARM, or the Linux CAS implementation (which we jump to at ffff0fc0)
is broken on your kernel (I looked at LXR's copy of that kernel
without the -Netgear
and it seemed okay).  That's all I can think of.
Russ

@davecheney
Copy link
Contributor Author

Comment 8:

I'm trying to get hold of the actual kernel source for the -netgear variant now.

@rsc
Copy link
Contributor

rsc commented Apr 30, 2011

Comment 9:

If you have a working gdb you could break into the program
(or insert a null pointer dereference) and then run
x/20i 0xffff0fc0
and just disassemble it.  That's more reliable than the kernel
sources anyway.
Russ

@davecheney
Copy link
Contributor Author

Comment 10:

Thanks Russ,
(gdb) r
Starting program: /home/dfc/go/src/cmd/gofix/5.out 
osopen.0:12:13: rewrote os.Open with O_CREATE but not O_TRUNC to os.Create
osopen.0:13:2: rewrote os.Open to os.Create with permission not 0666
osopen.0:14:13: rewrote os.Open with O_CREATE but not O_TRUNC to os.Create
osopen.0:15:2: rewrote os.Open to os.Create with permission not 0666
testdata/reflect.encode.go.in:299:7: cannot rewrite reflect type switch case with
non-reflect type interfaceOrPtrValue
Program received signal SIGSTOP, Stopped (signal).
0x0001f408 in ?? ()
(gdb) x/20i 0xffff0fc0
0xffff0fc0: Cannot access memory at address 0xffff0fc0
This should work, right ?

@gopherbot
Copy link
Contributor

Comment 11 by [email protected]:

Same here (an old android G1):
Linux localhost 2.6.27-00392-g8312baf #1 PREEMPT Sat May 23 20:19:35 CDT 2009 armv6l
unknown
Qualcomm MSM7201A (ARM11 core)
(gdbserver attached to some process)
Remote debugging using :1234
0xafe0d3e4 in ?? ()
(gdb) x/20i 0xffff0fc0
0xffff0fc0: Cannot access memory at address 0xffff0fc0
(gdb) x/20i 0xafe0d3e4
0xafe0d3e4: ldmia   sp!, {r4, r7}
0xafe0d3e8: bx  lr
...
Here some related infos:
http://lwn.net/Articles/314561/
http://osdir.com/ml/android-ndk/2010-06/msg00534.html
and the code in bionic:
http://www.google.com/codesearch/p?hl=en#XAzRy8oK4zA/libc/arch-arm/bionic/atomics_arm.S
Martin

@gopherbot
Copy link
Contributor

Comment 12 by [email protected]:

Hmm, it's there:
arm5:
arm-none-linux-gnueabi-objdump -D --target=binary -EL -z -w --architecture=arm mdump.dat
mdump.dat:     file format binary
Disassembly of section .data:
00000000 <.data>:
   0:   e5923000    ldr r3, [r2]
   4:   e0533000    subs    r3, r3, r0
   8:   05821000    streq   r1, [r2]
   c:   e2730000    rsbs    r0, r3, #0
  10:   e12fff1e    bx  lr
  14:   e1a00000    nop         ; (mov r0, r0)
  18:   e1a00000    nop         ; (mov r0, r0)
  1c:   e1a00000    nop         ; (mov r0, r0)
  20:   e59f0008    ldr r0, [pc, #8]    ; 0x30
  24:   e12fff1e    bx  lr
  28:   00000000    andeq   r0, r0, r0
  2c:   00000000    andeq   r0, r0, r0
  30:   000924c0    andeq   r2, r9, r0, asr #9
  34:   00000000    andeq   r0, r0, r0
  38:   00000000    andeq   r0, r0, r0
  3c:   00000003    andeq   r0, r0, r3
arm >=6:
00000000 <.data>:
   0:   e1923f9f    ldrex   r3, [r2]
   4:   e0533000    subs    r3, r3, r0
   8:   01823f91    strexeq r3, r1, [r2]
   c:   03330001    teqeq   r3, #1
  10:   0afffffa    beq 0x0
  14:   e2730000    rsbs    r0, r3, #0
  18:   e12fff1e    bx  lr
  1c:   00000000    andeq   r0, r0, r0
  20:   e59f0008    ldr r0, [pc, #8]    ; 0x30
  24:   e12fff1e    bx  lr
  28:   00000000    andeq   r0, r0, r0
  2c:   00000000    andeq   r0, r0, r0
  30:   000924c0    andeq   r2, r9, r0, asr #9
  34:   00000000    andeq   r0, r0, r0
  38:   00000000    andeq   r0, r0, r0
  3c:   00000003    andeq   r0, r0, r3

@davecheney
Copy link
Contributor Author

Comment 13:

Hi Russ,
You were right, it is the locking. I patched in the old arm5 asm from
http://code.google.com/p/go/source/detail?r=5a47c03cf0 and now cannot cause tests like
gofix or chan/doubleselect.go to deadlock.
I'm going to setup a few tests cases and run them overnight to see if there are any
lingering deadlocks.
Cheers
Dave

@rsc
Copy link
Contributor

rsc commented May 1, 2011

Comment 14:

Okay, that's good to know.  Thanks very much.
Now we have to figure out the right fix.  I don't want to
have to select different assembly for each chip during
the build process.  That's the whole reason the kernel
leaves an implementation at ffff0fc0.  On ARM 5 the
kernel is supposed to treat the cas implementation as
a restartable atomic sequence: if a process gets interrupted
in the middle of the sequence the kernel restarts it at
the beginning, so that the read+compare+write is
guaranteed not to have been interrupted and therefore
ran (effectively) atomically.
We must be doing something different compared to
the C library that is making the kernel not restart the
sequence properly.  Do you know where the restarting
code is?
Russ

@davecheney
Copy link
Contributor Author

Comment 15:

Hi Russ,
With the old arm5 swp code patched back in I was able complete 30 runs of test/run
without a single failure. For reference I've uploaded the source published by Netgear,
from a cursory look this is probably stock 2.6.22 plus some drivers.
https://bitbucket.org/davecheney/2.6.22.18-netgear/src/e87e5d833b12/arch/arm/kernel/entry-armv.S#cl-770
From the comment block the 'restarting' is apparently done by clearing 'z' on a possible
return from kernel space if the address of the faulting instruction (waves hands)
https://bitbucket.org/davecheney/2.6.22.18-netgear/src/e87e5d833b12/arch/arm/kernel/entry-armv.S#cl-346

@rsc
Copy link
Contributor

rsc commented May 2, 2011

Comment 16:

I'm sure that fixes the problem.  I'd just like to figure out how to work
with the Linux kernel implementation if at all possible, because it means
we don't have to build binaries targeted to specific processors.
Thank you!  I see the problem.  We are relying on the compare-and-swap
doing a real compare-and-swap but in fact this implementation can return
a false negative.  This is fixed in later versions of the kernel (2.6.24+) after
this change:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b49c0f24cf6744a3f4fd09289fe7cade349dead5
As that change description says, most uses are in a loop so it doesn't matter.
But there is one - in chan.c - that depends on the compare and swap
never returning a false negative.
Please try 4436072.
Russ

@davecheney
Copy link
Contributor Author

Comment 17:

Awesome, thanks Russ. I'll try this now and report back in a few hours.

@davecheney
Copy link
Contributor Author

Comment 18:

100 runs of test/doubleselect.go passed. So this is definitely an improvement.
fyi. I tried the test code in the linked change and this kernel does fail spuriously, so
this would look to be the solution.

@rsc
Copy link
Contributor

rsc commented May 2, 2011

Comment 19:

This issue was closed by revision 540feaa.

Status changed to Fixed.

@rsc
Copy link
Contributor

rsc commented May 3, 2011

Comment 20:

This issue was closed by revision 665c4ab05ffd.

rsc added a commit that referenced this issue May 11, 2015
««« CL 4436072 / e280d98747be
runtime, sync/atomic: fix arm cas

Works around bug in kernel implementation on old ARM5 kernels.
Bug was fixed on 26 Nov 2007 (between 2.6.23 and 2.6.24) but
old kernels persist.

Fixes #1750.

R=dfc, golang-dev
CC=golang-dev
https://golang.org/cl/4436072
»»»

TBR=adg
CC=golang-dev
https://golang.org/cl/4452064
@golang golang locked and limited conversation to collaborators Jun 24, 2016
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants