-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: invalidptr check triggered on linux/armv5 with cgo #12157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@zeebo, if you can share the code you are using that involves the allocation and use of the objects for which you are setting finalizers, that might suffice to see what might be going on. As Austin said, it really does look like a finalizer is being set on an allocation that is being kept in C but not in Go, or something like that. Thanks. |
@aclements, it looks like SetFinalizer never checks that the object is actually allocated. It checks that the object is from an in-use span, but if it had already been freed, SetFinalizer would still succeed. That might go unnoticed until the whole span was freed, and then at the next GC we'd see a crash like this. This theory still implicates some kind of bug in the way SetFinalizer is being used in @zeebo's program, but maybe for Go 1.6 we can detect SetFinalizer on freed objects during the call? Or maybe it's not worth it. |
I suppose we could do this reasonably efficiently by first checking the second word of the object. If it's 0xdead... then we also have to scan the span's free list to see if it's really free, but this is quite unlikely. I could spin a patch for @zeebo to try so we can check this theory. With my sweep-free allocation proposal, we would just have to check the span sweep offset and possibly the mark bitmap. |
Unfortunately, I've been unable to reproduce any more panics with the environment variable set. We have many finalizers in the code base. Is there any hint in the information provided to narrow down where the search is? Do you have any guesses as to what kind of invalid behavior to look for specifically? I'll look for finalizers on C allocated memory (I believe that's what @rsc meant in his first comment on this issue), but are there any more? I'd be willing to annotate all of our finalizer stuff, but I don't know what I'd annotate. I'll begin looking now, regardless, to see if anything obvious sticks out. |
It's weirder than that. I don't think the object with the finalizer is the problem. The bad pointer is the pointer to the finalizer function, not the object. I can't imagine why one would pass that in to C (it's not even possible without some serious gymnastics and I don't know what you'd do with it in C).
Thanks for trying. Did you get any without the environment variable set?
The finalizer function would have to be a closure that actually captures variables from its enclosing scope or a method (which turns in to a closure that captures the method receiver). This should be the only way we can see a (bad) heap-allocated function pointer. Otherwise the funcval would point in to static data, and we wouldn't see this panic. I can also tell you that the closure is at most 1,408 bytes, but I doubt that narrows things down much. :)
Unfortunately, as I mentioned at the top of this comment in response to @rsc's comment, this isn't a good filter for this problem. It isn't about the finalized object, it's about the finalizer function. |
So far, I've only seen the one reported in the ticket. |
@zeebo Can you look at your calls to SetFinalizer for any func expressions (func(p *pointer) { ... }) or method values? (A method value is x.M where x is something with a method M. Normally you call it, as in x.M(); a method value is when you don't call it but instead save it for later.) If you can share those maybe something will jump out at us. Thanks. |
Here's all the possible usages of SetFinalizer in the program:
All of them seem innocuous from the body and almost all of them are performing CGO calls to free resources. Nothing stood out to me as weird. Let me know if you want to see the bodies of any of the dependent methods, and I'll try to get them. A significant portion are open source, though. Only the sm/... packages are not. |
Thanks! These are the three that could be culpable, since they're the only ones that capture variables from their enclosing scope (I assume ctx in the latter two is a local variable/argument and not a global).
These all look fine. Unfortunately, I'm not sure what else to do short of waiting for another panic and seeing what patterns we can extract. |
In the latter two, ctx a context.Context. You can see the full source for the first at https://github.com/spacemonkeygo/openssl/blob/master/conn.go#L159 I'll keep watching for panics. Do you still think it's worth it to keep the environment variable? For example, could it be making the trigger because of some funny race condition? |
Given that gccheckmark hasn't found anything yet, it's probably more valuable to run without it and get more panics if you can. You're right that it's entirely possible gccheckmark is perturbing things enough to prevent the problem from happening. |
We won't do anything for this for Go 1.5 but I'm still interested to understand what is wrong. |
Any new panics? It could be that this is due to the non-atomic pointer writes, fixed on the release branch and scheduled to be part of Go 1.5.2. If you can rebuild a custom Go version, use |
I'll run code on go1.5.2 for a while and see if anything breaks. Thanks for the update. |
It appears that the branch has not been made yet on the git repo at https://go.googlesource.com/go. Am I doing something wrong? I'll try again in a few hours. |
Russ meant to say |
Of course, thank you. I'm starting it now. |
Well, there have been no panics in 10 days. I think everything is fine. It's too bad I never got enough details to definitively say that the bug is fixed, but I'm happy closing this issue if you are. |
Thanks for retrying. I bet it is the pointer writes. Glad we fixed it. |
Split from issue #12138 to separate arm report (this one) from amd64 report (#12158).
Reported by @zeebo (2015-08-13 17:54:46)
I upgraded my system to use go1.5rc1 and let it run for a couple of days. I observed this panic:
Just as last time (#11640) I'm willing to do whatever I can to help diagnose.
Comment by @aclements (2015-08-13 20:34:54)
@zeebo, and chance you're running with -race? We fixed a memory corruption bug after rc1 was branched, but it only affects running in -race mode.
If possible, it would be great if you can run your system with GODEBUG=gccheckmark=1. It slows things down quite a bit, but enables some very useful debug code.
Comment by @RLH (2015-08-13 20:36:40)
Also if you have any of the other go routine stacks that would help.
Comment by @aclements (2015-08-13 21:03:02)
@zeebo, is this also with GOOS=linux GOARCH=arm GOARM=5? If so, are you actually running on an ARMv5 or something newer?
Just some quick diagnosis from the traceback: the bad pointer is the *funcval field of a "specialfinalizer" finalizer record. It looks like a pretty reasonable pointer, but points in to a span that's in state _MSpanDead. There's only one way we're supposed to be able to observe a dead span in the h_spans array: if we free a large span and coalesce it with neighboring spans, only the first and last page in the coalesced range are guaranteed to be in state "free". The pages in between may point to any span, including one in state "dead". Given how far "p" is from "s.start", this seems the most likely way we found the dead span. If this is ARM, there's also always the possibility of a missing memory barrier, since h_spans is read by the garbage collector without locking.
If we in fact freed that span, that means we either somehow created a bogus finalizer record, or we created a fine finalizer record, but failed to mark the funcval in an earlier cycle and the span got collected. One weird thing is that I would expect that funcval to be in a small object span that was only one page and hence could not leave behind "dead" spans in h_spans. This is a long shot, @zeebo, but do you have any finalizers that are closures that access particularly large variables from their enclosing frame? Or, in general, any unusual finalizers?
Comment by @zeebo (2015-08-13 21:08:07)
It is not in race mode. It is an ARMv5 in the sense that no binaries will run on it if I don't specify GOARM=5. Unfortunately, I'm not much of a hardware guy so I don't know what that all entails.
I don't know if we have any unusual finalizers, but it's possible. We use them mostly for freeing resources in our openssl wrapper and some debug checks if we forget to close a file handle or something. I'll do a double check of the code base to see if anything sticks out as really strange.
Here is more of the stack dump. I'm not sure if it's all of it, but this is what I have:
I'll see if I can start running it with GODEBUG=gccheckmark=1. Is there anything in particular I should be observing with that flag?
Comment by @aclements (2015-08-13 21:37:10)
I believe the default is GOARM=6, so if it doesn't run with the default, you must have ARMv5. (I ask because ARM made a mess of its concurrency primitives, so it's basically impossible to run a multithreaded binary compiled for ARM v5 correctly on ARM v7 hardware.)
Panics. :) This enables an assertion in the runtime, so if it fails, it will panic with a bunch of debugging info.
Comment by @zeebo (2015-08-14 17:44:41)
I bet we have similar code of dubious nature in our code base. I haven't had any crashes with the debug environment variable yet.
I wonder if automated tooling can be written that just detects if the last usage of a variable in some function is in an unsafe.Pointer(&var) kind of form. I'd be ok with false positives if it helped me find all those issues.
Comment by @aclements (2015-08-14 17:51:23)
One of the things we're hoping to do for 1.6 is to write down the rules of pointers and Cgo. We've even tinkered with ways of enforcing these rules, but so far no concrete changes have come of that.
I know that go vet does some checking of unsafe.Pointer use, though I'm not sure what exactly.
Comment by @zeebo (2015-08-14 18:02:28)
As far as I'm aware the only checking it does is to make sure you don't squirrel away pointer values as integer types so the GC is still aware of them. :(
The text was updated successfully, but these errors were encountered: