Skip to content

runtime: crash with signal 0xb near runtime.sweepone #11027

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rhysh opened this issue Jun 1, 2015 · 10 comments
Closed

runtime: crash with signal 0xb near runtime.sweepone #11027

rhysh opened this issue Jun 1, 2015 · 10 comments
Milestone

Comments

@rhysh
Copy link
Contributor

rhysh commented Jun 1, 2015

$ go version
go version devel +8cd191b Sat May 30 12:21:56 2015 +0000 linux/amd64
$ uname -a | awk '$2="host"'
Linux host 3.13.0-52-generic #86~precise1-Ubuntu SMP Tue May 5 18:08:21 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

I have a process that receives data over a few hundred concurrent TCP connections and writes them to files. It's been crashing on recent versions of tip (it was stable on 1.4.1).

fatal error: unexpected signal during runtime execution
[signal 0xb code=0x80 addr=0x0 pc=0x4246fd]

runtime stack:
runtime.throw(0x8ef170, 0x2a)
    /usr/local/go/src/runtime/panic.go:527 +0x96
runtime.sigpanic()
    /usr/local/go/src/runtime/sigpanic_unix.go:12 +0x5d
runtime.mSpan_Sweep(0x7f0550633990, 0x8ec450008ec00, 0xc200002401)
    /usr/local/go/src/runtime/mgcsweep.go:182 +0x1bd
runtime.sweepone(0x1)
    /usr/local/go/src/runtime/mgcsweep.go:97 +0x161
runtime.gosweepone.func1()
    /usr/local/go/src/runtime/mgcsweep.go:109 +0x28
runtime.systemstack(0xc2094bdf08)
    /usr/local/go/src/runtime/asm_amd64.s:278 +0xb1
runtime.gosweepone(0xa8dd18)
    /usr/local/go/src/runtime/mgcsweep.go:110 +0x44
runtime.mCentral_CacheSpan(0xa8f020, 0x7f0550acb570)
    /usr/local/go/src/runtime/mcentral.go:43 +0xab
runtime.mCache_Refill(0x7f0550ad31c0, 0xe, 0x7f0550acb570)
    /usr/local/go/src/runtime/mcache.go:118 +0xd5
runtime.mallocgc.func2()
    /usr/local/go/src/runtime/malloc.go:608 +0x32
runtime.systemstack(0xc208016000)
    /usr/local/go/src/runtime/asm_amd64.s:262 +0x7c
runtime.mstart()
    /usr/local/go/src/runtime/proc1.go:656

goroutine 78 [running]:
runtime.systemstack_switch()
    /usr/local/go/src/runtime/asm_amd64.s:216 fp=0xc2086dbca8 sp=0xc2086dbca0
runtime.mallocgc(0xd0, 0x815da0, 0x0, 0xc2086dbda8)
    /usr/local/go/src/runtime/malloc.go:609 +0x7b9 fp=0xc2086dbd78 sp=0xc2086dbca8
runtime.newobject(0x815da0, 0x7f055091bfb0)
    /usr/local/go/src/runtime/malloc.go:731 +0x49 fp=0xc2086dbda0 sp=0xc2086dbd78
redacted(0xc210331080, 0x7e, 0x80, 0x7f055091bfb0, 0xc2080bec90, 0x0, 0x0)
    /redacted.go:314 +0xf6 fp=0xc2086dbe18 sp=0xc2086dbda0
redacted(0xc210331080, 0x7e, 0x80, 0x7f055091bfb0, 0xc2080bec90, 0x0, 0x0)
    /redacted.go:300 +0x70 fp=0xc2086dbe58 sp=0xc2086dbe18
redacted(0xc20800f5f0, 0xc2080c0180, 0xc208010420, 0xc2080fc050)
    /redacted.go:147 +0x128 fp=0xc2086dbf90 sp=0xc2086dbe58
redacted(0xc2086a2000, 0xc20800f5f0, 0xc2080c0180, 0xc208010420, 0xc2080fc050)
    /redacted.go:111 +0x6d fp=0xc2086dbfb8 sp=0xc2086dbf90
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1670 +0x1 fp=0xc2086dbfc0 sp=0xc2086dbfb8
created by redacted
    /redacted.go:112 +0x37f

goroutine 1 [chan receive, 102 minutes]:
main.main()
    /redacted.go:230 +0xe7b

goroutine 5 [chan send]:
redacted(0xc20809e000)
    /redacted.go:199 +0x1f0
created by redacted.init.1
    /redacted.go:184 +0x6e

[snip]
@ianlancetaylor ianlancetaylor added this to the Go1.5 milestone Jun 1, 2015
@ianlancetaylor
Copy link
Contributor

CC @RLH @aclements

It would be very helpful to have a reproducible test case.

@aclements
Copy link
Member

If it's reliable (or fairly reliable), but you can't reduce it to a test case you can share, it would be helpful if you could bisect it to a bad commit.

Another potentially useful thing to do would be to run with GODEBUG=efence=1, though that perturbs the memory layout enough that the crash conditions may not occur.

@rhysh
Copy link
Contributor Author

rhysh commented Jun 2, 2015

I've only seen this particular crash once in the past few days of running (#11023 is much more common), so I don't think I'll be able to bisect effectively.

I tried running with GODEBUG=efence=1, but my program crashes with "fatal error: out of memory (stackalloc)" every minute or so and I haven't seen it crash with any other errors of interest (it crashed once with #11023). Is the memory exhaustion expected even on amd64?

I'll see what I can do on a reduced test case.

@aclements
Copy link
Member

I tried running with GODEBUG=efence=1, but my program crashes with "fatal error: out of memory (stackalloc)" every minute or so and I haven't seen it crash with any other errors of interest (it crashed once with #11023). Is the memory exhaustion expected even on amd64?

That's a little surprising, but efence does change memory allocation patterns fairly substantially.

I'll see what I can do on a reduced test case.

Thanks!

@aclements
Copy link
Member

For reference, we've seen this twice on the build dashboard:

2015-05-07T21:08:29-9626561/nacl-amd64p32
2015-05-14T15:55:42-94934f8/linux-amd64-sid

I've also seen it once on a trybot:

https://storage.googleapis.com/go-build-log/98c65eab/freebsd-amd64-gce101_4e0f3a2f.log (https://go-review.googlesource.com/#/c/10481)

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/10714 mentions this issue.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/10713 mentions this issue.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/10791 mentions this issue.

methane pushed a commit to methane/go that referenced this issue Jun 6, 2015
Issues golang#10240, golang#10541, golang#10941, golang#11023, golang#11027 and possibly others are
indicating memory corruption in the runtime. One of the easiest places
to both get corruption and detect it is in the allocator's free lists
since they appear throughout memory and follow strict invariants. This
commit adds a check when sweeping a span that its free list is sane
and, if not, it prints the corrupted free list and panics. Hopefully
this will help us collect more information on these failures.

Change-Id: I6d417bcaeedf654943a5e068bd76b58bb02d4a64
aclements added a commit that referenced this issue Jun 7, 2015
Stack barriers assume that writes through pointers to frames above the
current frame will get write barriers, and hence these frames do not
need to be re-scanned to pick up these changes. For normal writes,
this is true. However, there are places in the runtime that use
typedmemmove to potentially write through pointers to higher frames
(such as mapassign1). Currently, typedmemmove does not execute write
barriers if the destination is on the stack. If there's a stack
barrier between the current frame and the frame being modified with
typedmemmove, and the stack barrier is not otherwise hit, it's
possible that the garbage collector will never see the updated pointer
and incorrectly reclaim the object.

Fix this by making heapBitsBulkBarrier (which lies behind typedmemmove
and its variants) detect when the destination is in the stack and
unwind stack barriers up to the point, forcing mark termination to
later rescan the effected frame and collect these pointers.

Fixes #11084. Might be related to #10240, #10541, #10941, #11023,
 #11027 and possibly others.

Change-Id: I323d6cd0f1d29fa01f8fc946f4b90e04ef210efd
Reviewed-on: https://go-review.googlesource.com/10791
Reviewed-by: Russ Cox <[email protected]>
aclements added a commit that referenced this issue Jun 16, 2015
Issues #10240, #10541, #10941, #11023, #11027 and possibly others are
indicating memory corruption in the runtime. One of the easiest places
to both get corruption and detect it is in the allocator's free lists
since they appear throughout memory and follow strict invariants. This
commit adds a check when sweeping a span that its free list is sane
and, if not, it prints the corrupted free list and panics. Hopefully
this will help us collect more information on these failures.

Change-Id: I6d417bcaeedf654943a5e068bd76b58bb02d4a64
Reviewed-on: https://go-review.googlesource.com/10713
Reviewed-by: Keith Randall <[email protected]>
Reviewed-by: Russ Cox <[email protected]>
Run-TryBot: Austin Clements <[email protected]>
aclements added a commit that referenced this issue Jun 16, 2015
Currently, when shrinkstack computes whether the halved stack
allocation will have enough room for the stack, it accounts for the
stack space that's actively in use but fails to leave extra room for
the stack guard space. As a result, *if* the minimum stack size is
small enough or the guard large enough, it may shrink the stack and
leave less than enough room to run nosplit functions. If the next
function called after the stack shrink is a nosplit function, it may
overflow the stack without noticing and overwrite non-stack memory.

We don't think this is happening under normal conditions right now.
The minimum stack allocation is 2K and the guard is 640 bytes. The
"worst case" stack shrink is from 4K (4048 bytes after stack barrier
array reservation) to 2K (2016 bytes after stack barrier array
reservation), which means the largest "used" size that will qualify
for shrinking is 4048/4 - 8 = 1004 bytes. After copying, that leaves
2016 - 1004 = 1012 bytes of available stack, which is significantly
more than the guard space.

If we were to reduce the minimum stack size to 1K or raise the guard
space above 1012 bytes, the logic in shrinkstack would no longer leave
enough space.

It's also possible to trigger this problem by setting
firstStackBarrierOffset to 0, which puts stack barriers in a debug
mode that steals away *half* of the stack for the stack barrier array
reservation. Then, the largest "used" size that qualifies for
shrinking is (4096/2)/4 - 8 = 504 bytes. After copying, that leaves
(2096/2) - 504 = 8 bytes of available stack; much less than the
required guard space. This causes failures like those in issue #11027
because func gc() shrinks its own stack and then immediately calls
casgstatus (a nosplit function), which overflows the stack and
overwrites a free list pointer in the neighboring span. However, since
this seems to require the special debug mode, we don't think it's
responsible for issue #11027.

To forestall all of these subtle issues, this commit modifies
shrinkstack to correctly account for the guard space when considering
whether to halve the stack allocation.

Change-Id: I7312584addc63b5bfe55cc384a1012f6181f1b9d
Reviewed-on: https://go-review.googlesource.com/10714
Reviewed-by: Keith Randall <[email protected]>
Reviewed-by: Russ Cox <[email protected]>
@aclements
Copy link
Member

Hi @rhysh. We've fixed several memory corruption and lost write barrier issues in the runtime over the past few weeks. As for #11023, please try to reproduce the problem with current master and reopen this issue if it's still happening. Thanks!

@rhysh
Copy link
Contributor Author

rhysh commented Jun 23, 2015

Hi @aclements - I ran my app with 8fa1a69 (from 17 Jun 2015) for a few days and did not observe any crashes. It looks like this is resolved. Thanks!

@golang golang locked and limited conversation to collaborators Jun 25, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants