Skip to content

mark - world not stopped while calling C functions #785

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
alberts opened this issue May 17, 2010 · 15 comments
Closed

mark - world not stopped while calling C functions #785

alberts opened this issue May 17, 2010 · 15 comments

Comments

@alberts
Copy link
Contributor

alberts commented May 17, 2010

What steps will reproduce the problem?

We have a program with a few goroutines that are all calling some C 
functions via Cgo. After running for a few minutes, the program dies with:

throw: mark - world not stopped

Right now the problem is very hard to reproduce. It seems very timing 
dependent.

The goroutines are typically sitting in chanrecv->gosched, cgocall-
>entersyscall and cgocall->exitsyscall and Go code calling runtime.mal-
>mal->mallocgc->gosched.

The throw is triggered because gp != g. Both gp and g have valid values 
pointing to goroutines created in newproc1.

Right now I would suspect the some of the bookkeeping dealing with 
goroutines and Cgo calls are assigning the wrong state to some of the 
goroutines.

What is the expected output? What do you see instead?

A typical stack:

panic PC=0x2b0ef4f41cd8
throw+0x3e /home/alberts/go/src/pkg/runtime/runtime.c:73
        throw(0xffffffff, 0x46a2b6)
mark+0x12a /home/alberts/go/src/pkg/runtime/mgc0.c:153
        mark()
gc+0x1ff /home/alberts/go/src/pkg/runtime/mgc0.c:320
        gc(0x436410, 0x2b0e00000020)
mallocgc+0x1ca /home/alberts/go/src/pkg/runtime/malloc.c:95
        mallocgc(0x2aaa00000000, 0x0, 0x0, 0x0, 0x2b0ef4f41db8, ...)
mal+0x36 /home/alberts/go/src/pkg/runtime/malloc.c:236
        mal(0x20, 0x100000000)
runtime.mal+0x1b /home/alberts/go/src/pkg/runtime/runtime1.c:3
        runtime.mal(0x20, 0x280200000006, 0x4162e3, 0x2aaa00000020)
<our go functions>
goexit /home/alberts/go/src/pkg/runtime/proc.c:145
        goexit()

goroutine 5 [3]:
runtime.entersyscall+0x50 /home/alberts/go/src/pkg/runtime/proc.c:563
        runtime.entersyscall()
cgocall+0x66 /home/alberts/go/src/pkg/runtime/cgocall.c:38
        cgocall(0x2b0ef35dc430, 0x2b0ef4e8b1d8, 0x1bfaf12c8, 0x40647f)
<our go functions>
goexit /home/alberts/go/src/pkg/runtime/proc.c:145
        goexit()

goroutine 4 [1]:
gosched+0x4e /home/alberts/go/src/pkg/runtime/proc.c:542
        gosched()
mallocgc+0x322 /home/alberts/go/src/pkg/runtime/malloc.c:34
        mallocgc(0x2b0ef4f3b8a0, 0x0, 0xaf021780763dbd00, 0x2000000000, 
0x432a6e, ...)
mal+0x36 /home/alberts/go/src/pkg/runtime/malloc.c:236
        mal(0x20, 0x100000000)
runtime.mal+0x1b /home/alberts/go/src/pkg/runtime/runtime1.c:3
        runtime.mal(0x20, 0x280200000006, 0x4162e3, 0x2aaa00000020)
<our go functions>
goexit /home/alberts/go/src/pkg/runtime/proc.c:145
        goexit()

goroutine 3 [1]:
gosched+0x4e /home/alberts/go/src/pkg/runtime/proc.c:542
        gosched()
semsleep2+0x16 /home/alberts/go/src/pkg/runtime/sema.c:94
        semsleep2()
semacquire+0x9c /home/alberts/go/src/pkg/runtime/sema.c:123
        semacquire(0x2b0ef4e8c008, 0x2b0ef4e8c008)
gc+0xab /home/alberts/go/src/pkg/runtime/mgc0.c:313
        gc(0x436410, 0x2b0e00000020)
mallocgc+0x1ca /home/alberts/go/src/pkg/runtime/malloc.c:95
        mallocgc(0x2aaa00000000, 0x0, 0x0, 0x0, 0x2b0ef4e8c0b8, ...)
mal+0x36 /home/alberts/go/src/pkg/runtime/malloc.c:236
        mal(0x20, 0x100000000)
runtime.mal+0x1b /home/alberts/go/src/pkg/runtime/runtime1.c:3
        runtime.mal(0x20, 0x280200000006, 0x4162e3, 0x2b0e00000020)
<our go functions>
goexit /home/alberts/go/src/pkg/runtime/proc.c:145
        goexit()

goroutine 2 [4]:
gosched+0x4e /home/alberts/go/src/pkg/runtime/proc.c:542
        gosched()
chanrecv+0x3e6 /home/alberts/go/src/pkg/runtime/chan.c:330
        chanrecv(0x2b0ef4ea6a10, 0x2b0ef4f3b960, 0x5, 0x2b0ef4ea8000, 
0x2b0ef4e8d220, ...)
runtime.chanrecv1+0x41 /home/alberts/go/src/pkg/runtime/chan.c:432
        runtime.chanrecv1(0x2b0ef4ea69c0, 0x2b0ef4e8d268)
<our go functions>
testing.tRunner+0x22 /home/alberts/go/src/pkg/testing/testing.go:132
        testing.tRunner(0x2b0ef4e83360, 0x0, 0x4013d1, 0x2b0ef4e83360)
goexit /home/alberts/go/src/pkg/runtime/proc.c:145
        goexit()

goroutine 1 [4]:
gosched+0x4e /home/alberts/go/src/pkg/runtime/proc.c:542
        gosched()
chanrecv+0x3e6 /home/alberts/go/src/pkg/runtime/chan.c:330
        chanrecv(0x2b0ef4ea6bf0, 0x2b0ef4e83320, 0x2b0e00000010, 
0xa0000000a, 0x8, ...)
runtime.chanrecv1+0x41 /home/alberts/go/src/pkg/runtime/chan.c:432
        runtime.chanrecv1(0x2b0ef4ea6ba0, 0x2b0ef4e89048)
<our go functions
mainstart+0xf /home/alberts/go/src/pkg/runtime/amd64/asm.s:60
        mainstart()
goexit /home/alberts/go/src/pkg/runtime/proc.c:145
        goexit()

What is your $GOOS?  $GOARCH?

GOARCH="amd64"
GOOS="linux"

GOMAXPROCS doesn't seem to matter.

Which revision are you using?  (hg identify)

release.2010-05-04

Please provide any additional information below.
@alberts
Copy link
Contributor Author

alberts commented May 17, 2010

Comment 1:

Can't quite figure out how a goroutine gets its status set to Grecovery other than in 
proc.c:panic(), but it seems to be happening.

@alberts
Copy link
Contributor Author

alberts commented May 18, 2010

Comment 2:

Missed the fall-through in the switch, so the goroutine causing the panic was in fact 
Grunning and not Grecovery. Investigating further.

@alberts
Copy link
Contributor Author

alberts commented May 18, 2010

Comment 3:

I added
printf("gp=%p, status=%d != g=%p, status=%d\n", gp, gp->status, g, g->status);
before the throw in mgc0.c:mark and modified runtime.c to do:
printf("\npanic g=%X PC=%X\n", g, (uint64)(uintptr)&unused);
and proc.c to do:
printf("\ngoroutine %d @ %p [%d]:\n", g->goid, g, g->status);
There seems to be some kind of race condition, because I see:
gp=0x2aea7ab40140, status=2 != g=0x2aea7ab40dc0, status=2
throw: mark - world not stopped
but then by the time the stacktraces are printed, gp's status is Grunnable (1):
panic g=0x2aea7ab40dc0 PC=0x2aea7ab44fa0
throw+0x3e /home/alberts/go/src/pkg/runtime/runtime.c:73
        throw(0xffffffff, 0x46a306)
mark+0x12d /home/alberts/go/src/pkg/runtime/mgc0.c:152
        mark()
gc+0x1ff /home/alberts/go/src/pkg/runtime/mgc0.c:319
        gc(0x436410, 0x2aea00000020)
mallocgc+0x1ca /home/alberts/go/src/pkg/runtime/malloc.c:95
        mallocgc(0x2aea00000000, 0x0, 0x0, 0x0, 0x2aea7ab45090, ...)
mal+0x36 /home/alberts/go/src/pkg/runtime/malloc.c:236
        mal(0x20, 0x100000000)
runtime.mal+0x1b /home/alberts/go/src/pkg/runtime/runtime1.c:7
        runtime.mal(0x20, 0x280200000006, 0x41648f, 0x2aea00000020)
<our go functions>
goexit /home/alberts/go/src/pkg/runtime/proc.c:145
        goexit()
goroutine 4 @ 0x2aea7ab40140 [1]:
gosched+0x4e /home/alberts/go/src/pkg/runtime/proc.c:542
        gosched()
mallocgc+0x322 /home/alberts/go/src/pkg/runtime/malloc.c:34
        mallocgc(0x2aaacac4c580, 0x0, 0xaf021780763dbd00, 0x2000000000, 0x432c25, 
...)
mal+0x36 /home/alberts/go/src/pkg/runtime/malloc.c:236
        mal(0x20, 0x100000000)
runtime.mal+0x1b /home/alberts/go/src/pkg/runtime/runtime1.c:7
        runtime.mal(0x20, 0x280200000006, 0x41648f, 0x2aaa00000020)
<our go functions>
goexit /home/alberts/go/src/pkg/runtime/proc.c:145
        goexit()

@rsc
Copy link
Contributor

rsc commented May 21, 2010

Comment 4:

Owner changed to [email protected].

Status changed to Accepted.

@rsc
Copy link
Contributor

rsc commented May 21, 2010

Comment 5:

Are you calling back into Go from C?

@alberts
Copy link
Contributor Author

alberts commented May 21, 2010

Comment 6:

No, we aren't going any callbacks.
The only special thing I can think of is that the time spent in the C function calls is 
very variable, depending on the data we are processing, which might introduce some 
interesting orderings in the completion of the Go routines.
I've been trying to get something you can reproduce by replacing our C code with C code 
that does random sleeps, but no luck yet.

@rsc
Copy link
Contributor

rsc commented May 21, 2010

Comment 7:

Don't spend too much time trying to reproduce just yet.
David Roundy has a reproducible case that smells like
something similar even though the crash is not exactly
the same.  Maybe the fix for his bug (not fixed yet) will fix yours.

@alberts
Copy link
Contributor Author

alberts commented May 21, 2010

Comment 8:

While I was investigating this bug, I was wondering if some state in the scheduler 
was perhaps getting corrupted.
I found this in runtime.h:
every C file linked into a Go program must include runtime.h
so that the C compiler knows to avoid other uses of these registers.
the Go compilers know to avoid them.
Is this going to be a problem when using Cgo to call already-compiled C code in third 
party libraries?

@alberts
Copy link
Contributor Author

alberts commented Jun 14, 2010

Comment 9:

Maybe this fix is also needed for amd64?
http://code.google.com/p/go/source/detail?r=c5e89a52d8a8ace22a5f1d382d597f3dd8700dbe#

@rsc
Copy link
Contributor

rsc commented Jun 25, 2010

Comment 10:

That fix shouldn't be necessary, because on amd64 we keep the per-thread state
in register (R14 and R15) instead of in memory.  Restoring the registers on the way
out of the signal handler will restore the state.
However, if your program is getting signals while the C code is running, that
would explain the bad behavior.  The Go runtime assumes its own private data
is in R14 and R15, but that's not true when calling into C.  Is your code taking
signals while running C code?

@alberts
Copy link
Contributor Author

alberts commented Jun 28, 2010

Comment 11:

I don't think the code is receiving any signals while in C code. Could it be that the C
code is overwriting those registers in another way?

@nsf
Copy link

nsf commented Jun 29, 2010

Comment 12:

I think this bug is related to this one: https://golang.org/issue/886
Because the fix I've introduced in issue #886 also changes behaviour of cgo calls. And
now I look into "runtime/cgocall.c", it appears that cgo uses the same locking mechanics
as runtime.LockOSThread(), therefore it is very likely that this is exactly the same bug.
Short bug description: more than one running thread when GC is active, resulting in
shared usage of thread local memory cache (aka mcache), which is not thread safe. A race
condition.

@rsc
Copy link
Contributor

rsc commented Jun 30, 2010

Comment 13:

nsf, thanks for pointing out the connection to this bug.
I think that the fix to issue #886 may well have fixed this bug too.
Albert, can you try updating to tip and see whether you still
see the problem in your programs?
Thanks.
Russ

@alberts
Copy link
Contributor Author

alberts commented Jul 1, 2010

Comment 14:

Looks good. The program that crashed within a few seconds to minutes has been running
for hours. I think this bug has been fixed.

@rsc
Copy link
Contributor

rsc commented Jul 2, 2010

Comment 15:

Great.  Thanks again, nsf.

Status changed to Fixed.

@golang golang locked and limited conversation to collaborators Jun 24, 2016
@rsc rsc removed their assignment Jun 22, 2022
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants