Skip to content

runtime: invalidptr check triggered on linux/armv5 with cgo #12157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rsc opened this issue Aug 15, 2015 · 19 comments
Closed

runtime: invalidptr check triggered on linux/armv5 with cgo #12157

rsc opened this issue Aug 15, 2015 · 19 comments

Comments

@rsc
Copy link
Contributor

rsc commented Aug 15, 2015

Split from issue #12138 to separate arm report (this one) from amd64 report (#12158).

Reported by @zeebo (2015-08-13 17:54:46)

I upgraded my system to use go1.5rc1 and let it run for a couple of days. I observed this panic:

runtime:objectstart Span weird: p=0x133a8580 k=0x99d4 s.start=0x13330000 s.limit=0x133b0000 s.state=4
fatal error: objectstart: bad pointer in unexpected span

runtime stack:
runtime.throw(0x9c2bb8, 0x2b)
    /home/jeff/go/src/runtime/panic.go:527 +0x78 fp=0xa07fec34 sp=0xa07fec28
runtime.heapBitsForObject(0x133a8580, 0x0, 0x0, 0x0, 0xa53a6f94)
    /home/jeff/go/src/runtime/mbitmap.go:214 +0x36c fp=0xa07fec58 sp=0xa07fec34
runtime.scanblock(0xb6143b90, 0x4, 0xccf990, 0xa07fecbc)
    /home/jeff/go/src/runtime/mgcmark.go:797 +0xd8 fp=0xa07fec9c sp=0xa07fec58
runtime.markroot(0x10f16000, 0x3)
    /home/jeff/go/src/runtime/mgcmark.go:102 +0x44c fp=0xa07fecec sp=0xa07fec9c
runtime.parfordo(0x10f16000)
    /home/jeff/go/src/runtime/parfor.go:95 +0x134 fp=0xa07fed40 sp=0xa07fecec
runtime.gcscan_m()
    /home/jeff/go/src/runtime/mgcmark.go:36 +0xa8 fp=0xa07fed64 sp=0xa07fed40
runtime.gc.func1()
    /home/jeff/go/src/runtime/mgc.go:996 +0x118 fp=0xa07fed84 sp=0xa07fed64
runtime.systemstack(0x11076000)
    /home/jeff/go/src/runtime/asm_arm.s:242 +0x80 fp=0xa07fed88 sp=0xa07fed84
runtime.mstart()
    /home/jeff/go/src/runtime/proc1.go:674 fp=0xa07fed88 sp=0xa07fed88

Just as last time (#11640) I'm willing to do whatever I can to help diagnose.

Comment by @aclements (2015-08-13 20:34:54)

@zeebo, and chance you're running with -race? We fixed a memory corruption bug after rc1 was branched, but it only affects running in -race mode.

If possible, it would be great if you can run your system with GODEBUG=gccheckmark=1. It slows things down quite a bit, but enables some very useful debug code.

Comment by @RLH (2015-08-13 20:36:40)

Also if you have any of the other go routine stacks that would help.

Comment by @aclements (2015-08-13 21:03:02)

@zeebo, is this also with GOOS=linux GOARCH=arm GOARM=5? If so, are you actually running on an ARMv5 or something newer?

Just some quick diagnosis from the traceback: the bad pointer is the *funcval field of a "specialfinalizer" finalizer record. It looks like a pretty reasonable pointer, but points in to a span that's in state _MSpanDead. There's only one way we're supposed to be able to observe a dead span in the h_spans array: if we free a large span and coalesce it with neighboring spans, only the first and last page in the coalesced range are guaranteed to be in state "free". The pages in between may point to any span, including one in state "dead". Given how far "p" is from "s.start", this seems the most likely way we found the dead span. If this is ARM, there's also always the possibility of a missing memory barrier, since h_spans is read by the garbage collector without locking.

If we in fact freed that span, that means we either somehow created a bogus finalizer record, or we created a fine finalizer record, but failed to mark the funcval in an earlier cycle and the span got collected. One weird thing is that I would expect that funcval to be in a small object span that was only one page and hence could not leave behind "dead" spans in h_spans. This is a long shot, @zeebo, but do you have any finalizers that are closures that access particularly large variables from their enclosing frame? Or, in general, any unusual finalizers?

Comment by @zeebo (2015-08-13 21:08:07)

It is not in race mode. It is an ARMv5 in the sense that no binaries will run on it if I don't specify GOARM=5. Unfortunately, I'm not much of a hardware guy so I don't know what that all entails.

I don't know if we have any unusual finalizers, but it's possible. We use them mostly for freeing resources in our openssl wrapper and some debug checks if we forget to close a file handle or something. I'll do a double check of the code base to see if anything sticks out as really strange.

Here is more of the stack dump. I'm not sure if it's all of it, but this is what I have:

runtime:objectstart Span weird: p=0x133a8580 k=0x99d4 s.start=0x13330000 s.limit=0x133b0000 s.state=4
fatal error: objectstart: bad pointer in unexpected span

runtime stack:
runtime.throw(0x9c2bb8, 0x2b)
    /home/jeff/go/src/runtime/panic.go:527 +0x78 fp=0xa07fec34 sp=0xa07fec28
runtime.heapBitsForObject(0x133a8580, 0x0, 0x0, 0x0, 0xa53a6f94)
    /home/jeff/go/src/runtime/mbitmap.go:214 +0x36c fp=0xa07fec58 sp=0xa07fec34
runtime.scanblock(0xb6143b90, 0x4, 0xccf990, 0xa07fecbc)
    /home/jeff/go/src/runtime/mgcmark.go:797 +0xd8 fp=0xa07fec9c sp=0xa07fec58
runtime.markroot(0x10f16000, 0x3)
    /home/jeff/go/src/runtime/mgcmark.go:102 +0x44c fp=0xa07fecec sp=0xa07fec9c
runtime.parfordo(0x10f16000)
    /home/jeff/go/src/runtime/parfor.go:95 +0x134 fp=0xa07fed40 sp=0xa07fecec
runtime.gcscan_m()
    /home/jeff/go/src/runtime/mgcmark.go:36 +0xa8 fp=0xa07fed64 sp=0xa07fed40
runtime.gc.func1()
    /home/jeff/go/src/runtime/mgc.go:996 +0x118 fp=0xa07fed84 sp=0xa07fed64
runtime.systemstack(0x11076000)
    /home/jeff/go/src/runtime/asm_arm.s:242 +0x80 fp=0xa07fed88 sp=0xa07fed84
runtime.mstart()
    /home/jeff/go/src/runtime/proc1.go:674 fp=0xa07fed88 sp=0xa07fed88

goroutine 50 [garbage collection scan]:
runtime.systemstack_switch()
    /home/jeff/go/src/runtime/asm_arm.s:187 +0x4 fp=0x10f1b558 sp=0x10f1b554
runtime.gc(0x0)
    /home/jeff/go/src/runtime/mgc.go:1005 +0x23c fp=0x10f1b7c4 sp=0x10f1b558
runtime.backgroundgc()
    /home/jeff/go/src/runtime/mgc.go:896 +0x34 fp=0x10f1b7dc sp=0x10f1b7c4
runtime.goexit()
    /home/jeff/go/src/runtime/asm_arm.s:1031 +0x4 fp=0x10f1b7dc sp=0x10f1b7dc
created by runtime.startGC
    /home/jeff/go/src/runtime/mgc.go:869 +0x178

goroutine 1 [chan receive, 3809 minutes]:
runtime.gopark(0xa70968, 0x11cd95b0, 0x94a138, 0xc, 0x17, 0x3)
    /home/jeff/go/src/runtime/proc.go:185 +0x148 fp=0x11cf9d70 sp=0x11cf9d5c
runtime.goparkunlock(0x11cd95b0, 0x94a138, 0xc, 0x17, 0x3)
    /home/jeff/go/src/runtime/proc.go:191 +0x44 fp=0x11cf9d8c sp=0x11cf9d70
runtime.chanrecv(0x783a38, 0x11cd9580, 0x11cf9e64, 0x11df3e01, 0x11df0000)
    /home/jeff/go/src/runtime/chan.go:505 +0x960 fp=0x11cf9df8 sp=0x11cf9d8c
runtime.chanrecv1(0x783a38, 0x11cd9580, 0x11cf9e64)
    /home/jeff/go/src/runtime/chan.go:349 +0x28 fp=0x11cf9e10 sp=0x11cf9df8
sm/flud/FludProcess.RunFludProcess(0xb28f50a0, 0x110440e0, 0x0, 0x0)
    /home/jeff/katamari/go/src/sm/flud/FludProcess/process.go:90 +0x210 fp=0x11cf9e6c sp=0x11cf9e10
main.runNode(0x0, 0x0)
    /home/jeff/katamari/go/src/sm/flud/bin/fludnode/main.go:123 +0x5b0 fp=0x11cf9f48 sp=0x11cf9e6c
main.main()
    /home/jeff/katamari/go/src/sm/flud/bin/fludnode/main.go:141 +0x20c fp=0x11cf9fa4 sp=0x11cf9f48
runtime.main()
    /home/jeff/go/src/runtime/proc.go:111 +0x2b4 fp=0x11cf9fcc sp=0x11cf9fa4
runtime.goexit()
    /home/jeff/go/src/runtime/asm_arm.s:1031 +0x4 fp=0x11cf9fcc sp=0x11cf9fcc

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
    /home/jeff/go/src/runtime/asm_arm.s:1031 +0x4 fp=0x10f30fd8 sp=0x10f30fd8

goroutine 2 [force gc (idle), 25 minutes]:
runtime.gopark(0xa70968, 0xce7358, 0x94b8f8, 0xf, 0x14, 0x1)
    /home/jeff/go/src/runtime/proc.go:185 +0x148 fp=0x10f1c7a8 sp=0x10f1c794
runtime.goparkunlock(0xce7358, 0x94b8f8, 0xf, 0x14, 0x1)
    /home/jeff/go/src/runtime/proc.go:191 +0x44 fp=0x10f1c7c4 sp=0x10f1c7a8
runtime.forcegchelper()
    /home/jeff/go/src/runtime/proc.go:152 +0xa8 fp=0x10f1c7dc sp=0x10f1c7c4
runtime.goexit()
    /home/jeff/go/src/runtime/asm_arm.s:1031 +0x4 fp=0x10f1c7dc sp=0x10f1c7dc
created by runtime.init.4
    /home/jeff/go/src/runtime/proc.go:141 +0x24

goroutine 3 [GC sweep wait]:
runtime.gopark(0xa70968, 0xce7648, 0x945ca8, 0xd, 0x81714, 0x1)
    /home/jeff/go/src/runtime/proc.go:185 +0x148 fp=0x10f1cf9c sp=0x10f1cf88
runtime.goparkunlock(0xce7648, 0x945ca8, 0xd, 0x14, 0x1)
    /home/jeff/go/src/runtime/proc.go:191 +0x44 fp=0x10f1cfb8 sp=0x10f1cf9c
runtime.bgsweep(0x10f12040)
    /home/jeff/go/src/runtime/mgcsweep.go:67 +0x13c fp=0x10f1cfd4 sp=0x10f1cfb8
runtime.goexit()
    /home/jeff/go/src/runtime/asm_arm.s:1031 +0x4 fp=0x10f1cfd4 sp=0x10f1cfd4
created by runtime.gcenable
    /home/jeff/go/src/runtime/mgc.go:205 +0x4c

goroutine 4 [finalizer wait]:
runtime.gopark(0xa70968, 0xcfd5b0, 0x94b6b8, 0xe, 0x14, 0x1)
    /home/jeff/go/src/runtime/proc.go:185 +0x148 fp=0x10f1d788 sp=0x10f1d774
runtime.goparkunlock(0xcfd5b0, 0x94b6b8, 0xe, 0x14, 0x1)
    /home/jeff/go/src/runtime/proc.go:191 +0x44 fp=0x10f1d7a4 sp=0x10f1d788
runtime.runfinq()
    /home/jeff/go/src/runtime/mfinal.go:154 +0xa4 fp=0x10f1d7dc sp=0x10f1d7a4
runtime.goexit()
    /home/jeff/go/src/runtime/asm_arm.s:1031 +0x4 fp=0x10f1d7dc sp=0x10f1d7dc
created by runtime.createfing
    /home/jeff/go/src/runtime/mfinal.go:135 +0x5c

goroutine 5 [syscall, 3810 minutes]:
runtime.notetsleepg(0xcfd940, 0xffffffff, 0xffffffff, 0x1)
    /home/jeff/go/src/runtime/lock_futex.go:202 +0x4c fp=0x10f1df98 sp=0x10f1df80
runtime.signal_recv(0x0)
    /home/jeff/go/src/runtime/sigqueue.go:111 +0x190 fp=0x10f1dfb8 sp=0x10f1df98
os/signal.loop()
    /home/jeff/go/src/os/signal/signal_unix.go:22 +0x14 fp=0x10f1dfdc sp=0x10f1dfb8
runtime.goexit()
    /home/jeff/go/src/runtime/asm_arm.s:1031 +0x4 fp=0x10f1dfdc sp=0x10f1dfdc
created by os/signal.init.1
    /home/jeff/go/src/os/signal/signal_unix.go:28 +0x30

goroutine 34 [select, 3810 minutes]:
runtime.gopark(0xa70984, 0x10f19700, 0x8ff6a0, 0x6, 0xcfd518, 0x2)
    /home/jeff/go/src/runtime/proc.go:185 +0x148 fp=0x10f195dc sp=0x10f195c8
runtime.selectgoImpl(0x10f19700, 0x0, 0xc)
    /home/jeff/go/src/runtime/select.go:392 +0x14d4 fp=0x10f196b8 sp=0x10f195dc
runtime.selectgo(0x10f19700)
    /home/jeff/go/src/runtime/select.go:212 +0x10 fp=0x10f196c8 sp=0x10f196b8
sm/discovery.respondUntilError(0xb28b5020, 0x1102c020, 0x11032018, 0x3039, 0xb6a41200, 0x0, 0x0)
    /home/jeff/katamari/go/src/sm/discovery/responder.go:93 +0x288 fp=0x10f19750 sp=0x10f196c8
sm/discovery.RespondBytesOnPort(0xb28b5020, 0x1102c020, 0x11032018, 0x3039)
    /home/jeff/katamari/go/src/sm/discovery/responder.go:63 +0x78 fp=0x10f19798 sp=0x10f19750
sm/discovery.RespondOnPort(0xb28b5020, 0x1102c020, 0x11032010, 0x3039)
    /home/jeff/katamari/go/src/sm/discovery/responder.go:51 +0x40 fp=0x10f197ac sp=0x10f19798
sm/flud/discovery.(*ResponderProtocol).Run.func1(0x10f0ab10, 0xb28b5020, 0x1102c020)
    /home/jeff/katamari/go/src/sm/flud/discovery/discovery.go:67 +0xe8 fp=0x10f197cc sp=0x10f197ac
runtime.goexit()
    /home/jeff/go/src/runtime/asm_arm.s:1031 +0x4 fp=0x10f197cc sp=0x10f197cc
created by sm/flud/discovery.(*ResponderProtocol).Run
    /home/jeff/katamari/go/src/sm/flud/discovery/discovery.go:72 +0x160

goroutine 11 [select, 3810 minutes]:
runtime.gopark(0xa70984, 0x10f1ef80, 0x8ff6a0, 0x6, 0x51818, 0x2)
    /home/jeff/go/src/runtime/proc.go:185 +0x148 fp=0x10f1ee70 sp=0x10f1ee5c
runtime.selectgoImpl(0x10f1ef80, 0x0, 0xc)
    /home/jeff/go/src/runtime/select.go:392 +0x14d4 fp=0x10f1ef4c sp=0x10f1ee70
runtime.selectgo(0x10f1ef80)
    /home/jeff/go/src/runtime/select.go:212 +0x10 fp=0x10f1ef5c sp=0x10f1ef4c
sm/space/utils/godump.platformListenAndDump()
    /home/jeff/katamari/go/src/sm/space/utils/godump/godump_others.go:24 +0x124 fp=0x10f1efd0 sp=0x10f1ef5c
sm/space/utils/godump.listenAndDump()
    /home/jeff/katamari/go/src/sm/space/utils/godump/godump.go:35 +0x3c fp=0x10f1efdc sp=0x10f1efd0
runtime.goexit()
    /home/jeff/go/src/runtime/asm_arm.s:1031 +0x4 fp=0x10f1efdc sp=0x10f1efdc
created by sm/space/utils/godump.init.1
    /home/jeff/katamari/go/src/sm/space/utils/godump/godump.go:25 +0x24

goroutine 12 [select, 3810 minutes, locked to thread]:
runtime.gopark(0xa70984, 0x10f1f78c, 0x8ff6a0, 0x6, 0x56318, 0x2)
    /home/jeff/go/src/runtime/proc.go:185 +0x148 fp=0x10f1f674 sp=0x10f1f660
runtime.selectgoImpl(0x10f1f78c, 0x0, 0xc)
    /home/jeff/go/src/runtime/select.go:392 +0x14d4 fp=0x10f1f750 sp=0x10f1f674
runtime.selectgo(0x10f1f78c)
    /home/jeff/go/src/runtime/select.go:212 +0x10 fp=0x10f1f760 sp=0x10f1f750
runtime.ensureSigM.func1()
    /home/jeff/go/src/runtime/signal1_unix.go:227 +0x428 fp=0x10f1f7dc sp=0x10f1f760
runtime.goexit()
    /home/jeff/go/src/runtime/asm_arm.s:1031 +0x4 fp=0x10f1f7dc sp=0x10f1f7dc
created by runtime.ensureSigM
    /home/jeff/go/src/runtime/signal1_unix.go:240 +0xe4

goroutine 14 [syscall, 3810 minutes]:
syscall.Syscall6(0x72, 0x4377, 0x10f19eb0, 0x0, 0x1104e000, 0x0, 0x0, 0x803c8, 0xb6a414b8, 0x6)
    /home/jeff/go/src/syscall/asm_linux_arm.s:48 +0x8 fp=0x10f19e64 sp=0x10f19e60
syscall.wait4(0x4377, 0x10f19eb0, 0x0, 0x1104e000, 0x50, 0x0, 0x0)
    /home/jeff/go/src/syscall/zsyscall_linux_arm.go:172 +0x64 fp=0x10f19e90 sp=0x10f19e64
syscall.Wait4(0x4377, 0x10f19ed4, 0x0, 0x1104e000, 0x0, 0x0, 0x0)
    /home/jeff/go/src/syscall/syscall_linux.go:256 +0x54 fp=0x10f19eb4 sp=0x10f19e90
os.(*Process).wait(0x11060240, 0x0, 0x0, 0x0)
    /home/jeff/go/src/os/exec_unix.go:22 +0xcc fp=0x10f19f0c sp=0x10f19eb4
os.(*Process).Wait(0x11060240, 0x0, 0x0, 0x0)
    /home/jeff/go/src/os/doc.go:45 +0x2c fp=0x10f19f20 sp=0x10f19f0c
os/exec.(*Cmd).Wait(0x10f56140, 0x0, 0x0)
    /home/jeff/go/src/os/exec/exec.go:380 +0x1b4 fp=0x10f19f88 sp=0x10f19f20
github.com/spacemonkeygo/spacelog.CaptureOutputToProcess.func1(0x10f56140)
    /home/jeff/katamari/go/src/github.com/spacemonkeygo/spacelog/capture.go:61 +0x1c fp=0x10f19fd4 sp=0x10f19f88
runtime.goexit()
    /home/jeff/go/src/runtime/asm_arm.s:1031 +0x4 fp=0x10f19fd4 sp=0x10f19fd4
created by github.com/spacemonkeygo/spacelog.CaptureOutputToProcess
    /home/jeff/katamari/go/src/github.com/spacemonkeygo/spacelog/capture.go:65 +0x204

goroutine 35 [IO wait]:
runtime.gopark(0xa7094c, 0xb6148424, 0x8f8fc8, 0x7, 0x1b, 0x5)
    /home/jeff/go/src/runtime/proc.go:185 +0x148 fp=0x10f2edd8 sp=0x10f2edc4
runtime.netpollblock(0xb6148410, 0x72, 0x0, 0x34234)
    /home/jeff/go/src/runtime/netpoll.go:338 +0x178 fp=0x10f2edfc sp=0x10f2edd8
net.runtime_pollWait(0xb6148410, 0x72, 0x10f0e060)
    /home/jeff/go/src/runtime/netpoll.go:157 +0x60 fp=0x10f2ee10 sp=0x10f2edfc
net.(*pollDesc).Wait(0x10f121b8, 0x72, 0x0, 0x0)
    /home/jeff/go/src/net/fd_poll_runtime.go:73 +0x34 fp=0x10f2ee20 sp=0x10f2ee10
net.(*pollDesc).WaitRead(0x10f121b8, 0x0, 0x0)
    /home/jeff/go/src/net/fd_poll_runtime.go:78 +0x30 fp=0x10f2ee34 sp=0x10f2ee20
net.(*netFD).readFrom(0x10f12180, 0x10f70a00, 0x80, 0x80, 0x0, 0x0, 0x0, 0xb6141030, 0x10f0e060)
    /home/jeff/go/src/net/fd_unix.go:259 +0x20c fp=0x10f2eea8 sp=0x10f2ee34
net.(*UDPConn).ReadFromUDP(0x11032020, 0x10f70a00, 0x80, 0x80, 0x0, 0x10f70a00, 0x0, 0x0)
    /home/jeff/go/src/net/udpsock_posix.go:61 +0xe4 fp=0x10f2ef18 sp=0x10f2eea8
sm/discovery.handleRequest(0xb28b5090, 0x11032020, 0x11032018, 0x0, 0x0)
    /home/jeff/katamari/go/src/sm/discovery/responder.go:117 +0x80 fp=0x10f2ef8c sp=0x10f2ef18
sm/discovery.handleRequests(0xb28b5090, 0x11032020, 0x11032018, 0x0, 0x0)
    /home/jeff/katamari/go/src/sm/discovery/responder.go:107 +0x44 fp=0x10f2efa4 sp=0x10f2ef8c
sm/discovery.respondUntilError.func1(0x10f122c0, 0xb28b5090, 0x11032020, 0x11032018)
    /home/jeff/katamari/go/src/sm/discovery/responder.go:89 +0x2c fp=0x10f2efc4 sp=0x10f2efa4
runtime.goexit()

I'll see if I can start running it with GODEBUG=gccheckmark=1. Is there anything in particular I should be observing with that flag?

Comment by @aclements (2015-08-13 21:37:10)

It is an ARMv5 in the sense that no binaries will run on it if I don't specify GOARM=5.

I believe the default is GOARM=6, so if it doesn't run with the default, you must have ARMv5. (I ask because ARM made a mess of its concurrency primitives, so it's basically impossible to run a multithreaded binary compiled for ARM v5 correctly on ARM v7 hardware.)

I'll see if I can start running it with GODEBUG=gccheckmark=1. Is there anything in particular I should be observing with that flag?

Panics. :) This enables an assertion in the runtime, so if it fails, it will panic with a bunch of debugging info.

Comment by @zeebo (2015-08-14 17:44:41)

I bet we have similar code of dubious nature in our code base. I haven't had any crashes with the debug environment variable yet.

I wonder if automated tooling can be written that just detects if the last usage of a variable in some function is in an unsafe.Pointer(&var) kind of form. I'd be ok with false positives if it helped me find all those issues.

Comment by @aclements (2015-08-14 17:51:23)

I bet we have similar code of dubious nature in our code base.

One of the things we're hoping to do for 1.6 is to write down the rules of pointers and Cgo. We've even tinkered with ways of enforcing these rules, but so far no concrete changes have come of that.

I wonder if automated tooling can be written that just detects if the last usage of a variable in some function is in an unsafe.Pointer(&var) kind of form.

I know that go vet does some checking of unsafe.Pointer use, though I'm not sure what exactly.

Comment by @zeebo (2015-08-14 18:02:28)

As far as I'm aware the only checking it does is to make sure you don't squirrel away pointer values as integer types so the GC is still aware of them. :(

@rsc
Copy link
Contributor Author

rsc commented Aug 15, 2015

@zeebo, if you can share the code you are using that involves the allocation and use of the objects for which you are setting finalizers, that might suffice to see what might be going on. As Austin said, it really does look like a finalizer is being set on an allocation that is being kept in C but not in Go, or something like that. Thanks.

@rsc rsc modified the milestones: Go1.5Maybe, Go1.5 Aug 15, 2015
@rsc
Copy link
Contributor Author

rsc commented Aug 15, 2015

@aclements, it looks like SetFinalizer never checks that the object is actually allocated. It checks that the object is from an in-use span, but if it had already been freed, SetFinalizer would still succeed. That might go unnoticed until the whole span was freed, and then at the next GC we'd see a crash like this.

This theory still implicates some kind of bug in the way SetFinalizer is being used in @zeebo's program, but maybe for Go 1.6 we can detect SetFinalizer on freed objects during the call? Or maybe it's not worth it.

@aclements
Copy link
Member

This theory still implicates some kind of bug in the way SetFinalizer is being used in @zeebo's program, but maybe for Go 1.6 we can detect SetFinalizer on freed objects during the call? Or maybe it's not worth it.

I suppose we could do this reasonably efficiently by first checking the second word of the object. If it's 0xdead... then we also have to scan the span's free list to see if it's really free, but this is quite unlikely. I could spin a patch for @zeebo to try so we can check this theory.

With my sweep-free allocation proposal, we would just have to check the span sweep offset and possibly the mark bitmap.

@zeebo
Copy link
Contributor

zeebo commented Aug 17, 2015

Unfortunately, I've been unable to reproduce any more panics with the environment variable set.

We have many finalizers in the code base. Is there any hint in the information provided to narrow down where the search is? Do you have any guesses as to what kind of invalid behavior to look for specifically? I'll look for finalizers on C allocated memory (I believe that's what @rsc meant in his first comment on this issue), but are there any more? I'd be willing to annotate all of our finalizer stuff, but I don't know what I'd annotate.

I'll begin looking now, regardless, to see if anything obvious sticks out.

@aclements
Copy link
Member

As Austin said, it really does look like a finalizer is being set on an allocation that is being kept in C but not in Go, or something like that.

It's weirder than that. I don't think the object with the finalizer is the problem. The bad pointer is the pointer to the finalizer function, not the object. I can't imagine why one would pass that in to C (it's not even possible without some serious gymnastics and I don't know what you'd do with it in C).

Unfortunately, I've been unable to reproduce any more panics with the environment variable set.

Thanks for trying. Did you get any without the environment variable set?

We have many finalizers in the code base. Is there any hint in the information provided to narrow down where the search is?

The finalizer function would have to be a closure that actually captures variables from its enclosing scope or a method (which turns in to a closure that captures the method receiver). This should be the only way we can see a (bad) heap-allocated function pointer. Otherwise the funcval would point in to static data, and we wouldn't see this panic.

I can also tell you that the closure is at most 1,408 bytes, but I doubt that narrows things down much. :)

I'll look for finalizers on C allocated memory (I believe that's what @rsc meant in his first comment on this issue), but are there any more?

Unfortunately, as I mentioned at the top of this comment in response to @rsc's comment, this isn't a good filter for this problem. It isn't about the finalized object, it's about the finalizer function.

@zeebo
Copy link
Contributor

zeebo commented Aug 17, 2015

Thanks for trying. Did you get any without the environment variable set?

So far, I've only seen the one reported in the ticket.

@rsc
Copy link
Contributor Author

rsc commented Aug 18, 2015

@zeebo Can you look at your calls to SetFinalizer for any func expressions (func(p *pointer) { ... }) or method values? (A method value is x.M where x is something with a method M. Normally you call it, as in x.M(); a method value is when you don't call it but instead save it for later.) If you can share those maybe something will jump out at us. Thanks.

@zeebo
Copy link
Contributor

zeebo commented Aug 18, 2015

Here's all the possible usages of SetFinalizer in the program:

src/github.com/jtolds/oom-watcher/eventfd/eventfd.go
32: runtime.SetFinalizer(efd, func(efd *EventFD) {
33-     efd.Close()
34- })

src/github.com/spacemonkeygo/openssl/cert.go
79: runtime.SetFinalizer(name, func(n *Name) {
80-     C.X509_NAME_free(n.name)
81- })

126:    runtime.SetFinalizer(c, func(c *Certificate) {
127-        C.X509_free(c.x)
128-    })

339:    runtime.SetFinalizer(x, func(x *Certificate) {
340-        C.X509_free(x.x)
341-    })

365:    runtime.SetFinalizer(key, func(key *pKey) {
366-        C.EVP_PKEY_free(key.key)
367-    })

src/github.com/spacemonkeygo/openssl/ciphers.go
133:    runtime.SetFinalizer(ctx, func(ctx *cipherCtx) {
134-        C.EVP_CIPHER_CTX_free(ctx.ctx)
135-    })

src/github.com/spacemonkeygo/openssl/conn.go
159:    runtime.SetFinalizer(c, func(c *Conn) {
160-        c.into_ssl.Disconnect(into_ssl_cbio)
161-        c.from_ssl.Disconnect(from_ssl_cbio)
162-        C.SSL_free(c.ssl)
163-    })

335:    runtime.SetFinalizer(cert, func(cert *Certificate) {
336-        C.X509_free(cert.x)
337-    })

src/github.com/spacemonkeygo/openssl/ctx.go
141:    runtime.SetFinalizer(c, func(c *Ctx) {
142-        C.SSL_CTX_free(c.ctx)
143-    })

328:    runtime.SetFinalizer(store, func(s *CertificateStore) {
329-        C.X509_STORE_free(s.store)
330-    })

402:    runtime.SetFinalizer(cert, func(cert *Certificate) {
403-        C.X509_free(cert.x)
404-    })

src/github.com/spacemonkeygo/openssl/dhparam.go
49: runtime.SetFinalizer(dhparams, func(dhparams *DH) {
50-     C.DH_free(dhparams.dh)
51- })

src/github.com/spacemonkeygo/openssl/engine.go
47: runtime.SetFinalizer(e, func(e *Engine) {
48-     C.ENGINE_finish(e.e)
49-     C.ENGINE_free(e.e)
50- })

src/github.com/spacemonkeygo/openssl/key.go
241:    runtime.SetFinalizer(p, func(p *pKey) {
242-        C.EVP_PKEY_free(p.key)
243-    })

276:    runtime.SetFinalizer(p, func(p *pKey) {
277-        C.EVP_PKEY_free(p.key)
278-    })

311:    runtime.SetFinalizer(p, func(p *pKey) {
312-        C.EVP_PKEY_free(p.key)
313-    })

333:    runtime.SetFinalizer(p, func(p *pKey) {
334-        C.EVP_PKEY_free(p.key)
335-    })

src/github.com/spacemonkeygo/openssl/sha1.go
45: runtime.SetFinalizer(hash, func(hash *SHA1Hash) { hash.Close() })

src/github.com/spacemonkeygo/openssl/sha256.go
45: runtime.SetFinalizer(hash, func(hash *SHA256Hash) { hash.Close() })

src/sm/flud/diskq/wrappers.go
// note: finalizerClose is a top level function. this is the body:
// func finalizerClose(f *FileWrapper) {
//  defer f.close_cb()
//  logger.Critf("leaked filehandle: %s", f.fh.Name())
//  mon.Event("leaked_filehandle")
//  logger.Errore(f.fh.Close())
// }
47: runtime.SetFinalizer(f, finalizerClose)

116:    runtime.SetFinalizer(f, nil)

src/sm/flud/utils/crypto/sha_linux.go
86: runtime.SetFinalizer(a, func(a *afALG) {
87-     a.Close()
88- })

145:    runtime.SetFinalizer(a, nil)

src/sm/leveldb-cgo/leveldb/leveldb.go
116:    runtime.SetFinalizer(batch, func(batch *WriteBatch) {
117-        batch.Close()
118-    })

126:        runtime.SetFinalizer(self, nil)

194:    runtime.SetFinalizer(db, func(db *DB) {
195-        db.Close(ctx)
196-    })

257:    runtime.SetFinalizer(self, nil)

437:    runtime.SetFinalizer(iterator, func(iterator *Iterator) {
438-        iterator.Close(ctx)
439-    })

504:    runtime.SetFinalizer(self, nil)

src/sm/space/crypto/aes_linux.go
227:    runtime.SetFinalizer(a, func(a *afALG) {
228-        a.Close()
229-    })

261:    runtime.SetFinalizer(a, nil)

src/sm/space/network/chunk.go
102:    runtime.SetFinalizer(c, func(c *chunkDownloader) { c.fec.Destroy() })

All of them seem innocuous from the body and almost all of them are performing CGO calls to free resources. Nothing stood out to me as weird. Let me know if you want to see the bodies of any of the dependent methods, and I'll try to get them.

A significant portion are open source, though. Only the sm/... packages are not.

@aclements
Copy link
Member

Thanks! These are the three that could be culpable, since they're the only ones that capture variables from their enclosing scope (I assume ctx in the latter two is a local variable/argument and not a global).

src/github.com/spacemonkeygo/openssl/conn.go
159:    runtime.SetFinalizer(c, func(c *Conn) {
160-        c.into_ssl.Disconnect(into_ssl_cbio)
161-        c.from_ssl.Disconnect(from_ssl_cbio)
162-        C.SSL_free(c.ssl)
163-    })

src/sm/leveldb-cgo/leveldb/leveldb.go
194:    runtime.SetFinalizer(db, func(db *DB) {
195-        db.Close(ctx)
196-    })

437:    runtime.SetFinalizer(iterator, func(iterator *Iterator) {
438-        iterator.Close(ctx)
439-    })

These all look fine.

Unfortunately, I'm not sure what else to do short of waiting for another panic and seeing what patterns we can extract.

@zeebo
Copy link
Contributor

zeebo commented Aug 18, 2015

In the latter two, ctx a context.Context. You can see the full source for the first at https://github.com/spacemonkeygo/openssl/blob/master/conn.go#L159

I'll keep watching for panics. Do you still think it's worth it to keep the environment variable? For example, could it be making the trigger because of some funny race condition?

@aclements
Copy link
Member

I'll keep watching for panics. Do you still think it's worth it to keep the environment variable? For example, could it be making the trigger because of some funny race condition?

Given that gccheckmark hasn't found anything yet, it's probably more valuable to run without it and get more panics if you can. You're right that it's entirely possible gccheckmark is perturbing things enough to prevent the problem from happening.

@rsc
Copy link
Contributor Author

rsc commented Aug 19, 2015

We won't do anything for this for Go 1.5 but I'm still interested to understand what is wrong.

@rsc
Copy link
Contributor Author

rsc commented Nov 24, 2015

Any new panics? It could be that this is due to the non-atomic pointer writes, fixed on the release branch and scheduled to be part of Go 1.5.2. If you can rebuild a custom Go version, use git fetch origin release-branch.go1.5.2 && git checkout release-branch.go1.5.2 and try that. Otherwise, Go 1.5.2 will be out in a few weeks.

@zeebo
Copy link
Contributor

zeebo commented Nov 24, 2015

I'll run code on go1.5.2 for a while and see if anything breaks. Thanks for the update.

@zeebo
Copy link
Contributor

zeebo commented Nov 24, 2015

It appears that the branch has not been made yet on the git repo at https://go.googlesource.com/go. Am I doing something wrong? I'll try again in a few hours.

@aclements
Copy link
Member

Russ meant to say git fetch origin release-branch.go1.5 && git checkout release-branch.go1.5.

@zeebo
Copy link
Contributor

zeebo commented Nov 24, 2015

Of course, thank you. I'm starting it now.

@zeebo
Copy link
Contributor

zeebo commented Dec 4, 2015

Well, there have been no panics in 10 days. I think everything is fine. It's too bad I never got enough details to definitively say that the bug is fixed, but I'm happy closing this issue if you are.

@rsc
Copy link
Contributor Author

rsc commented Dec 5, 2015

Thanks for retrying. I bet it is the pointer writes. Glad we fixed it.

@rsc rsc closed this as completed Dec 5, 2015
@golang golang locked and limited conversation to collaborators Jul 1, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants