mark - world not stopped while calling C functions #785

alberts · 2010-05-17T14:00:12Z

What steps will reproduce the problem?

We have a program with a few goroutines that are all calling some C 
functions via Cgo. After running for a few minutes, the program dies with:

throw: mark - world not stopped

Right now the problem is very hard to reproduce. It seems very timing 
dependent.

The goroutines are typically sitting in chanrecv->gosched, cgocall-
>entersyscall and cgocall->exitsyscall and Go code calling runtime.mal-
>mal->mallocgc->gosched.

The throw is triggered because gp != g. Both gp and g have valid values 
pointing to goroutines created in newproc1.

Right now I would suspect the some of the bookkeeping dealing with 
goroutines and Cgo calls are assigning the wrong state to some of the 
goroutines.

What is the expected output? What do you see instead?

A typical stack:

panic PC=0x2b0ef4f41cd8
throw+0x3e /home/alberts/go/src/pkg/runtime/runtime.c:73
        throw(0xffffffff, 0x46a2b6)
mark+0x12a /home/alberts/go/src/pkg/runtime/mgc0.c:153
        mark()
gc+0x1ff /home/alberts/go/src/pkg/runtime/mgc0.c:320
        gc(0x436410, 0x2b0e00000020)
mallocgc+0x1ca /home/alberts/go/src/pkg/runtime/malloc.c:95
        mallocgc(0x2aaa00000000, 0x0, 0x0, 0x0, 0x2b0ef4f41db8, ...)
mal+0x36 /home/alberts/go/src/pkg/runtime/malloc.c:236
        mal(0x20, 0x100000000)
runtime.mal+0x1b /home/alberts/go/src/pkg/runtime/runtime1.c:3
        runtime.mal(0x20, 0x280200000006, 0x4162e3, 0x2aaa00000020)
<our go functions>
goexit /home/alberts/go/src/pkg/runtime/proc.c:145
        goexit()

goroutine 5 [3]:
runtime.entersyscall+0x50 /home/alberts/go/src/pkg/runtime/proc.c:563
        runtime.entersyscall()
cgocall+0x66 /home/alberts/go/src/pkg/runtime/cgocall.c:38
        cgocall(0x2b0ef35dc430, 0x2b0ef4e8b1d8, 0x1bfaf12c8, 0x40647f)
<our go functions>
goexit /home/alberts/go/src/pkg/runtime/proc.c:145
        goexit()

goroutine 4 [1]:
gosched+0x4e /home/alberts/go/src/pkg/runtime/proc.c:542
        gosched()
mallocgc+0x322 /home/alberts/go/src/pkg/runtime/malloc.c:34
        mallocgc(0x2b0ef4f3b8a0, 0x0, 0xaf021780763dbd00, 0x2000000000, 
0x432a6e, ...)
mal+0x36 /home/alberts/go/src/pkg/runtime/malloc.c:236
        mal(0x20, 0x100000000)
runtime.mal+0x1b /home/alberts/go/src/pkg/runtime/runtime1.c:3
        runtime.mal(0x20, 0x280200000006, 0x4162e3, 0x2aaa00000020)
<our go functions>
goexit /home/alberts/go/src/pkg/runtime/proc.c:145
        goexit()

goroutine 3 [1]:
gosched+0x4e /home/alberts/go/src/pkg/runtime/proc.c:542
        gosched()
semsleep2+0x16 /home/alberts/go/src/pkg/runtime/sema.c:94
        semsleep2()
semacquire+0x9c /home/alberts/go/src/pkg/runtime/sema.c:123
        semacquire(0x2b0ef4e8c008, 0x2b0ef4e8c008)
gc+0xab /home/alberts/go/src/pkg/runtime/mgc0.c:313
        gc(0x436410, 0x2b0e00000020)
mallocgc+0x1ca /home/alberts/go/src/pkg/runtime/malloc.c:95
        mallocgc(0x2aaa00000000, 0x0, 0x0, 0x0, 0x2b0ef4e8c0b8, ...)
mal+0x36 /home/alberts/go/src/pkg/runtime/malloc.c:236
        mal(0x20, 0x100000000)
runtime.mal+0x1b /home/alberts/go/src/pkg/runtime/runtime1.c:3
        runtime.mal(0x20, 0x280200000006, 0x4162e3, 0x2b0e00000020)
<our go functions>
goexit /home/alberts/go/src/pkg/runtime/proc.c:145
        goexit()

goroutine 2 [4]:
gosched+0x4e /home/alberts/go/src/pkg/runtime/proc.c:542
        gosched()
chanrecv+0x3e6 /home/alberts/go/src/pkg/runtime/chan.c:330
        chanrecv(0x2b0ef4ea6a10, 0x2b0ef4f3b960, 0x5, 0x2b0ef4ea8000, 
0x2b0ef4e8d220, ...)
runtime.chanrecv1+0x41 /home/alberts/go/src/pkg/runtime/chan.c:432
        runtime.chanrecv1(0x2b0ef4ea69c0, 0x2b0ef4e8d268)
<our go functions>
testing.tRunner+0x22 /home/alberts/go/src/pkg/testing/testing.go:132
        testing.tRunner(0x2b0ef4e83360, 0x0, 0x4013d1, 0x2b0ef4e83360)
goexit /home/alberts/go/src/pkg/runtime/proc.c:145
        goexit()

goroutine 1 [4]:
gosched+0x4e /home/alberts/go/src/pkg/runtime/proc.c:542
        gosched()
chanrecv+0x3e6 /home/alberts/go/src/pkg/runtime/chan.c:330
        chanrecv(0x2b0ef4ea6bf0, 0x2b0ef4e83320, 0x2b0e00000010, 
0xa0000000a, 0x8, ...)
runtime.chanrecv1+0x41 /home/alberts/go/src/pkg/runtime/chan.c:432
        runtime.chanrecv1(0x2b0ef4ea6ba0, 0x2b0ef4e89048)
<our go functions
mainstart+0xf /home/alberts/go/src/pkg/runtime/amd64/asm.s:60
        mainstart()
goexit /home/alberts/go/src/pkg/runtime/proc.c:145
        goexit()

What is your $GOOS?  $GOARCH?

GOARCH="amd64"
GOOS="linux"

GOMAXPROCS doesn't seem to matter.

Which revision are you using?  (hg identify)

release.2010-05-04

Please provide any additional information below.

alberts · 2010-05-17T16:12:16Z

Comment 1:

Can't quite figure out how a goroutine gets its status set to Grecovery other than in 
proc.c:panic(), but it seems to be happening.

alberts · 2010-05-18T06:45:33Z

Comment 2:

Missed the fall-through in the switch, so the goroutine causing the panic was in fact 
Grunning and not Grecovery. Investigating further.

alberts · 2010-05-18T08:13:47Z

Comment 3:

I added
printf("gp=%p, status=%d != g=%p, status=%d\n", gp, gp->status, g, g->status);
before the throw in mgc0.c:mark and modified runtime.c to do:
printf("\npanic g=%X PC=%X\n", g, (uint64)(uintptr)&unused);
and proc.c to do:
printf("\ngoroutine %d @ %p [%d]:\n", g->goid, g, g->status);
There seems to be some kind of race condition, because I see:
gp=0x2aea7ab40140, status=2 != g=0x2aea7ab40dc0, status=2
throw: mark - world not stopped
but then by the time the stacktraces are printed, gp's status is Grunnable (1):
panic g=0x2aea7ab40dc0 PC=0x2aea7ab44fa0
throw+0x3e /home/alberts/go/src/pkg/runtime/runtime.c:73
        throw(0xffffffff, 0x46a306)
mark+0x12d /home/alberts/go/src/pkg/runtime/mgc0.c:152
        mark()
gc+0x1ff /home/alberts/go/src/pkg/runtime/mgc0.c:319
        gc(0x436410, 0x2aea00000020)
mallocgc+0x1ca /home/alberts/go/src/pkg/runtime/malloc.c:95
        mallocgc(0x2aea00000000, 0x0, 0x0, 0x0, 0x2aea7ab45090, ...)
mal+0x36 /home/alberts/go/src/pkg/runtime/malloc.c:236
        mal(0x20, 0x100000000)
runtime.mal+0x1b /home/alberts/go/src/pkg/runtime/runtime1.c:7
        runtime.mal(0x20, 0x280200000006, 0x41648f, 0x2aea00000020)
<our go functions>
goexit /home/alberts/go/src/pkg/runtime/proc.c:145
        goexit()
goroutine 4 @ 0x2aea7ab40140 [1]:
gosched+0x4e /home/alberts/go/src/pkg/runtime/proc.c:542
        gosched()
mallocgc+0x322 /home/alberts/go/src/pkg/runtime/malloc.c:34
        mallocgc(0x2aaacac4c580, 0x0, 0xaf021780763dbd00, 0x2000000000, 0x432c25, 
...)
mal+0x36 /home/alberts/go/src/pkg/runtime/malloc.c:236
        mal(0x20, 0x100000000)
runtime.mal+0x1b /home/alberts/go/src/pkg/runtime/runtime1.c:7
        runtime.mal(0x20, 0x280200000006, 0x41648f, 0x2aaa00000020)
<our go functions>
goexit /home/alberts/go/src/pkg/runtime/proc.c:145
        goexit()

rsc · 2010-05-21T01:18:00Z

Comment 4:

Owner changed to [email protected].

Status changed to Accepted.

rsc · 2010-05-21T05:53:37Z

Comment 5:

Are you calling back into Go from C?

alberts · 2010-05-21T06:41:08Z

Comment 6:

No, we aren't going any callbacks.
The only special thing I can think of is that the time spent in the C function calls is 
very variable, depending on the data we are processing, which might introduce some 
interesting orderings in the completion of the Go routines.
I've been trying to get something you can reproduce by replacing our C code with C code 
that does random sleeps, but no luck yet.

rsc · 2010-05-21T06:50:52Z

Comment 7:

Don't spend too much time trying to reproduce just yet.
David Roundy has a reproducible case that smells like
something similar even though the crash is not exactly
the same.  Maybe the fix for his bug (not fixed yet) will fix yours.

alberts · 2010-05-21T07:01:44Z

Comment 8:

While I was investigating this bug, I was wondering if some state in the scheduler 
was perhaps getting corrupted.
I found this in runtime.h:
every C file linked into a Go program must include runtime.h
so that the C compiler knows to avoid other uses of these registers.
the Go compilers know to avoid them.
Is this going to be a problem when using Cgo to call already-compiled C code in third 
party libraries?

alberts · 2010-06-14T08:43:24Z

Comment 9:

Maybe this fix is also needed for amd64?
http://code.google.com/p/go/source/detail?r=c5e89a52d8a8ace22a5f1d382d597f3dd8700dbe#

rsc · 2010-06-25T17:24:38Z

Comment 10:

That fix shouldn't be necessary, because on amd64 we keep the per-thread state
in register (R14 and R15) instead of in memory.  Restoring the registers on the way
out of the signal handler will restore the state.
However, if your program is getting signals while the C code is running, that
would explain the bad behavior.  The Go runtime assumes its own private data
is in R14 and R15, but that's not true when calling into C.  Is your code taking
signals while running C code?

alberts · 2010-06-28T16:38:18Z

Comment 11:

I don't think the code is receiving any signals while in C code. Could it be that the C
code is overwriting those registers in another way?

nsf · 2010-06-29T14:53:26Z

Comment 12:

I think this bug is related to this one: https://golang.org/issue/886
Because the fix I've introduced in issue #886 also changes behaviour of cgo calls. And
now I look into "runtime/cgocall.c", it appears that cgo uses the same locking mechanics
as runtime.LockOSThread(), therefore it is very likely that this is exactly the same bug.
Short bug description: more than one running thread when GC is active, resulting in
shared usage of thread local memory cache (aka mcache), which is not thread safe. A race
condition.

rsc · 2010-06-30T05:09:54Z

Comment 13:

nsf, thanks for pointing out the connection to this bug.
I think that the fix to issue #886 may well have fixed this bug too.
Albert, can you try updating to tip and see whether you still
see the problem in your programs?
Thanks.
Russ

alberts · 2010-07-01T07:58:21Z

Comment 14:

Looks good. The program that crashed within a few seconds to minutes has been running
for hours. I think this bug has been fixed.

rsc · 2010-07-02T04:52:07Z

Comment 15:

Great.  Thanks again, nsf.

Status changed to Fixed.

alberts added fixed labels Jul 2, 2010

alberts assigned rsc Jul 2, 2010

golang locked and limited conversation to collaborators Jun 24, 2016

gopherbot added the FrozenDueToAge label Jun 24, 2016

rsc removed their assignment Jun 22, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mark - world not stopped while calling C functions #785

mark - world not stopped while calling C functions #785

alberts commented May 17, 2010

alberts commented May 17, 2010

alberts commented May 18, 2010

alberts commented May 18, 2010

rsc commented May 21, 2010

rsc commented May 21, 2010

alberts commented May 21, 2010

rsc commented May 21, 2010

alberts commented May 21, 2010

alberts commented Jun 14, 2010

rsc commented Jun 25, 2010

alberts commented Jun 28, 2010

nsf commented Jun 29, 2010

rsc commented Jun 30, 2010

alberts commented Jul 1, 2010

rsc commented Jul 2, 2010

mark - world not stopped while calling C functions #785

mark - world not stopped while calling C functions #785

Comments

alberts commented May 17, 2010

alberts commented May 17, 2010

alberts commented May 18, 2010

alberts commented May 18, 2010

rsc commented May 21, 2010

rsc commented May 21, 2010

alberts commented May 21, 2010

rsc commented May 21, 2010

alberts commented May 21, 2010

alberts commented Jun 14, 2010

rsc commented Jun 25, 2010

alberts commented Jun 28, 2010

nsf commented Jun 29, 2010

rsc commented Jun 30, 2010

alberts commented Jul 1, 2010

rsc commented Jul 2, 2010